CRUMB a card from devarno-cloud

Canonical Serialisation Pipeline

stratt advanced 7 min read

ELI5

To get the same fingerprint on every machine, STRATT runs the unit through a five-step laundry: read the YAML, drop nulls, normalise text, sort keys, encode bytes. Skip any step and two identical-looking units hash differently.

Technical Deep Dive

Five-Stage Pipeline

flowchart LR
Y["YAML 1.2<br/>core schema parse"] --> N["Strip nulls<br/>+ NFC normalise"]
N --> S["Sort keys<br/>UTF-16 order"]
S --> J["Compact JSON<br/>UTF-8, no BOM"]
J --> H["Blake3 hash<br/>blake3:{64hex}"]

Stage Contracts

StageInputOutputNotes
1. ParseYAML 1.2 stringJS objectCore schema — no custom tags
2. CleanobjectobjectRemoves null properties; NFC-normalises every string
3. SortobjectobjectRecursive key sort by UTF-16 code unit order
4. EncodeobjectUint8ArrayCompact JSON, UTF-8, no trailing newline, no BOM
5. HashbytesstringBlake3 → blake3:{64 lowercase hex}

API Surface (@stratt/fingerprint)

  • canonicalise(obj) → Uint8Array — stages 2–4
  • canonicalJson(obj) → string — stages 2–4 minus encoding
  • fingerprintYaml(yaml) → Fingerprint — full pipeline 1–5
  • fingerprintBytes(bytes) → Fingerprint — stage 5 only

Test Vectors

Fourteen vectors TV-01..TV-14 in canonical-serialisation-v1.md cover minimal units, empty arrays, YAML block scalars, Unicode NFC pairs, and reverse-key inputs. A new implementation passes the spec only when all 14 round-trip.

Key Terms

  • NFC → Unicode Normalisation Form C (canonical composition); applied per-string at stage 2.
  • SPEC_VERSION → constant "1" exported from packages/fingerprint/src/types.ts; bumped only on a breaking pipeline change.
  • Fingerprint{ full: string, prefix: bigint, algorithm: string } returned by stage 5.

Q&A

Q: Why UTF-16 code unit order rather than codepoint order? A: It matches String.prototype.localeCompare defaults in JavaScript, keeping the reference implementation trivial; ports must replicate UTF-16 ordering exactly even outside JS.

Q: Why hash compact JSON instead of the original YAML? A: YAML has many equivalent surface forms (block vs flow, quoted vs bare). Funnelling to compact JSON eliminates that ambiguity before bytes reach Blake3.

Examples

A unit with keys [type, domain, slug] and another with [domain, slug, type] produce identical fingerprints because stage 3 sorts both into the same order before stage 4 encodes them.

neighbors on the map