Zstd Compression Pipeline

nestr intermediate 5 min read

ELI5

Compress is like packing groceries into a vacuum bag while the cashier prints a receipt at the same time: one stream of bytes goes into the bag (.pellet) and into the receipt-tape (sha256) simultaneously, then a label gets stuck on the bag.

Technical Deep Dive

Store.Compress(sourcePath, level) (engine/internal/pellets/pellets.go:62) builds a single byte stream:

walk(sourcePath) → tar.Writer → zstd.Encoder → io.MultiWriter[file, sha256]

Execution Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant S as Store.Compress
    participant T as tar.Writer
    participant Z as zstd.Encoder
    participant F as os.File(.pellet)
    participant H as sha256.New
    C->>S: Compress(path, level)
    S->>S: filepath.Walk → totalSize
    S->>Z: NewWriter(level→preset)
    S->>T: NewWriter(zstd)
    loop for each file in walk
        T->>Z: tar header + bytes
        Z->>F: encoded bytes
        Z->>H: same encoded bytes
    end
    T->>Z: Close()
    Z->>F: flush
    S->>S: build Pellet{ID, ratio, checksum}
    S->>F: write .meta.json sidecar
    S-->>C: *Pellet

Level → Preset Mapping

Level range	klauspost preset
1–4	`SpeedFastest`
5–9	`SpeedDefault`
10–22	`SpeedBestCompression`

Extraction Side

Store.Extract(id, target) reads .pellet through zstd.Decoder → tar.Reader. Each header is checked: if filepath.Clean(joined) does not have target as a prefix, the entry is rejected — the standard guard against ../ archive escapes. Regular files are written via os.Create + io.Copy (the only extraction strategy actually implemented; the README mentions symlink/hardlink but they are not wired in this revision of the source).

Key Terms

MultiWriter → io.MultiWriter(file, hasher); the same bytes are written to disk and consumed by sha256 in one pass.
Preset → the klauspost/compress encoder mode selected from a numeric level band.
Path prefix check → the runtime guard that prevents tar.Reader from writing outside the extraction target.

Q&A

Q: Why is the checksum over compressed bytes instead of source bytes? A: It costs zero extra passes — the multiwriter sees compressed output already in transit. The trade-off is that the hash verifies the pellet artefact, not equivalence of two source trees.

Q: Will the same source tree at the same level always produce identical bytes? A: Only if walk order, file mtimes, and tar header padding are identical. zstd itself is deterministic at a given preset; tar is the variable.

Q: What HTTP status does a path-traversal entry produce on extract? A: The handler surfaces it as a 500 with an error message; the bad entry aborts the whole extraction — no partial files are kept on the happy path because they are written into target directly.

Examples

Compressing a 120 MB workspace at level 19 takes ~3 s on a modern laptop and produces ~19 MB; level 3 finishes in ~400 ms but yields ~32 MB. The histogram nestr_pellet_compress_duration_seconds records this so Web’s CompressForm can show realistic ETAs.

neighbors on the map

Multi-Layer Caching Strategy debugging stale link data
FNP Cost Optimization & Karpenter optimizing cloud infrastructure costs
Archive Directory Layout wiring a new EventSource backend