CRUMB a card from devarno-cloud

Zstd Compression Pipeline

nestr intermediate 5 min read

ELI5

Compress is like packing groceries into a vacuum bag while the cashier prints a receipt at the same time: one stream of bytes goes into the bag (.pellet) and into the receipt-tape (sha256) simultaneously, then a label gets stuck on the bag.

Technical Deep Dive

Store.Compress(sourcePath, level) (engine/internal/pellets/pellets.go:62) builds a single byte stream:

walk(sourcePath) → tar.Writer → zstd.Encoder → io.MultiWriter[file, sha256]

Execution Sequence Diagram

sequenceDiagram
participant C as Caller
participant S as Store.Compress
participant T as tar.Writer
participant Z as zstd.Encoder
participant F as os.File(.pellet)
participant H as sha256.New
C->>S: Compress(path, level)
S->>S: filepath.Walk → totalSize
S->>Z: NewWriter(level→preset)
S->>T: NewWriter(zstd)
loop for each file in walk
T->>Z: tar header + bytes
Z->>F: encoded bytes
Z->>H: same encoded bytes
end
T->>Z: Close()
Z->>F: flush
S->>S: build Pellet{ID, ratio, checksum}
S->>F: write .meta.json sidecar
S-->>C: *Pellet

Level → Preset Mapping

Level rangeklauspost preset
1–4SpeedFastest
5–9SpeedDefault
10–22SpeedBestCompression

Extraction Side

Store.Extract(id, target) reads .pellet through zstd.Decodertar.Reader. Each header is checked: if filepath.Clean(joined) does not have target as a prefix, the entry is rejected — the standard guard against ../ archive escapes. Regular files are written via os.Create + io.Copy (the only extraction strategy actually implemented; the README mentions symlink/hardlink but they are not wired in this revision of the source).

Key Terms

  • MultiWriterio.MultiWriter(file, hasher); the same bytes are written to disk and consumed by sha256 in one pass.
  • Preset → the klauspost/compress encoder mode selected from a numeric level band.
  • Path prefix check → the runtime guard that prevents tar.Reader from writing outside the extraction target.

Q&A

Q: Why is the checksum over compressed bytes instead of source bytes? A: It costs zero extra passes — the multiwriter sees compressed output already in transit. The trade-off is that the hash verifies the pellet artefact, not equivalence of two source trees.

Q: Will the same source tree at the same level always produce identical bytes? A: Only if walk order, file mtimes, and tar header padding are identical. zstd itself is deterministic at a given preset; tar is the variable.

Q: What HTTP status does a path-traversal entry produce on extract? A: The handler surfaces it as a 500 with an error message; the bad entry aborts the whole extraction — no partial files are kept on the happy path because they are written into target directly.

Examples

Compressing a 120 MB workspace at level 19 takes ~3 s on a modern laptop and produces ~19 MB; level 3 finishes in ~400 ms but yields ~32 MB. The histogram nestr_pellet_compress_duration_seconds records this so Web’s CompressForm can show realistic ETAs.

neighbors on the map