CRUMB a card from devarno-cloud

Blake3 Canonical Serialisation Pipeline

grace advanced 8 min read

ELI5

Turning a YAML file into a fingerprint is a five-station assembly line. Parse, clean, sort, encode, hash. Skip a station or do it slightly wrong on one machine and the seal won’t match the seal stamped by another machine — same content, different fingerprint, FM-01.

Technical Deep Dive

Source: CANONICAL-SERIALISATION-SPEC.md (1215 lines), packages/fingerprint/src/{canonicalise,hash,verify}.ts.

Five Stages

StageInputOutputOperation
1UTF-8 YAML bytesJS objectYAML 1.2 core schema parse, merge:false, uniqueKeys:true
2JS objectCleaned objectRemove fingerprint (top-level only), recursive null removal, NFC normalise
3Cleaned objectJSON stringRecursive key sort (UTF-16 code units), compact JSON.stringify
4JSON stringUTF-8 bytesTextEncoder.encode, no BOM, no trailing newline
5UTF-8 bytes64-hex digestBlake3-256, prefixed blake3:

Pipeline

flowchart LR
A[YAML source<br/>UTF-8 bytes] --> B[Stage 1<br/>parse YAML 1.2 core]
B --> C[Stage 2a<br/>strip top-level fingerprint]
C --> D[Stage 2b<br/>recursive null removal]
D --> E[Stage 2c<br/>NFC unicode normalise]
E --> F[Stage 3a<br/>sortKeysDeep UTF-16]
F --> G[Stage 3b<br/>JSON.stringify no spaces]
G --> H[Stage 4<br/>TextEncoder UTF-8]
H --> I[Stage 5<br/>Blake3-256]
I --> J[blake3:64-hex]
style C fill:#fef3c7
style D fill:#fef3c7
style E fill:#fef3c7

Stage 1 Required Config

parse(yamlString, {
schema: 'core',
version: '1.2',
merge: false,
uniqueKeys: true,
})

YAML 1.2 core schema means yes/no/on/off are strings, not booleans. Timestamp-like values are strings, not Date objects. Block scalars | preserve newlines; > folds them.

Stage 2 Critical Rules

  • Field exclusion: only top-level fingerprint is removed. modified carries metadata and stays.
  • Null removal: recursive at all nesting levels. Empty arrays [] and empty objects {} are preserved (TV-06).
  • NFC normalisation: applied to all string keys and values. Decomposed and precomposed forms of é must produce identical fingerprints (TV-13a/b).

Divergence from RFC 8785 (JCS)

AspectRFC 8785STRATT
InputJSONYAML
Field exclusionnonetop-level fingerprint
Null handlingpreservedremoved
Unicode normalisationnoneNFC
Key sort, compact, escapingalignedaligned

Verification

packages/fingerprint/src/verify.ts:10-54:

  1. Extract stored fingerprint.
  2. Recompute pipeline on the object minus fingerprint.
  3. Compare full strings → verified | tampered | error.

Critical Constraint

blake3-wasm is pinned to exact 2.1.5. Version 3.0.0 introduces an async API that breaks deterministic synchronous hashing call sites.

Key Terms

  • NFC → Unicode Normalisation Form C (Canonical Decomposition + Canonical Composition); collapses visually-identical sequences to one byte form.
  • TV-XX → One of 14 reference test vectors in Section 10 of CANONICAL-SERIALISATION-SPEC.md; conformance gate for any reimplementation.
  • Tampered → Verification result when computed digest ≠ stored digest; surfaces as FM-01.

Q&A

Q: Why YAML 1.2 core specifically? A: YAML 1.1 promotes yes/no/on/off to booleans, which would change the canonical form depending on how the file is written. Core schema makes them strings — bytes-in-bytes-out determinism.

Q: Why is the fingerprint field stripped before hashing? A: A digest cannot contain itself. Stripping it lets the same canonical pipeline be used to compute and to verify.

Q: What if my YAML uses merge keys (<<)? A: merge:false disables them. Stage 1 fails. Merge keys are forbidden because they introduce structural ambiguity that defeats canonicalisation.

Examples

Two contributors author the same role unit. One types é as U+00E9, the other as U+0065+U+0301. Without Stage 2c, their JSON strings differ in byte sequence and produce different Blake3 digests — FM-01 false positive. NFC collapses both to U+00E9 before serialisation.

neighbors on the map