Zstd Compression Pipeline
nestr intermediate 5 min read
ELI5
Compress is like packing groceries into a vacuum bag while the cashier prints a receipt at the same time: one stream of bytes goes into the bag (.pellet) and into the receipt-tape (sha256) simultaneously, then a label gets stuck on the bag.
Technical Deep Dive
Store.Compress(sourcePath, level) (engine/internal/pellets/pellets.go:62) builds a single byte stream:
walk(sourcePath) → tar.Writer → zstd.Encoder → io.MultiWriter[file, sha256]Execution Sequence Diagram
sequenceDiagram participant C as Caller participant S as Store.Compress participant T as tar.Writer participant Z as zstd.Encoder participant F as os.File(.pellet) participant H as sha256.New C->>S: Compress(path, level) S->>S: filepath.Walk → totalSize S->>Z: NewWriter(level→preset) S->>T: NewWriter(zstd) loop for each file in walk T->>Z: tar header + bytes Z->>F: encoded bytes Z->>H: same encoded bytes end T->>Z: Close() Z->>F: flush S->>S: build Pellet{ID, ratio, checksum} S->>F: write .meta.json sidecar S-->>C: *PelletLevel → Preset Mapping
| Level range | klauspost preset |
|---|---|
| 1–4 | SpeedFastest |
| 5–9 | SpeedDefault |
| 10–22 | SpeedBestCompression |
Extraction Side
Store.Extract(id, target) reads .pellet through zstd.Decoder → tar.Reader. Each header is checked: if filepath.Clean(joined) does not have target as a prefix, the entry is rejected — the standard guard against ../ archive escapes. Regular files are written via os.Create + io.Copy (the only extraction strategy actually implemented; the README mentions symlink/hardlink but they are not wired in this revision of the source).
Key Terms
- MultiWriter →
io.MultiWriter(file, hasher); the same bytes are written to disk and consumed by sha256 in one pass. - Preset → the klauspost/compress encoder mode selected from a numeric level band.
- Path prefix check → the runtime guard that prevents
tar.Readerfrom writing outside the extraction target.
Q&A
Q: Why is the checksum over compressed bytes instead of source bytes? A: It costs zero extra passes — the multiwriter sees compressed output already in transit. The trade-off is that the hash verifies the pellet artefact, not equivalence of two source trees.
Q: Will the same source tree at the same level always produce identical bytes? A: Only if walk order, file mtimes, and tar header padding are identical. zstd itself is deterministic at a given preset; tar is the variable.
Q: What HTTP status does a path-traversal entry produce on extract?
A: The handler surfaces it as a 500 with an error message; the bad entry aborts the whole extraction — no partial files are kept on the happy path because they are written into target directly.
Examples
Compressing a 120 MB workspace at level 19 takes ~3 s on a modern laptop and produces ~19 MB; level 3 finishes in ~400 ms but yields ~32 MB. The histogram nestr_pellet_compress_duration_seconds records this so Web’s CompressForm can show realistic ETAs.
neighbors on the map
- Multi-Layer Caching Strategy debugging stale link data
- FNP Cost Optimization & Karpenter optimizing cloud infrastructure costs
- Archive Directory Layout wiring a new EventSource backend