Canonical Serialisation Pipeline
stratt advanced 7 min read
ELI5
To get the same fingerprint on every machine, STRATT runs the unit through a five-step laundry: read the YAML, drop nulls, normalise text, sort keys, encode bytes. Skip any step and two identical-looking units hash differently.
Technical Deep Dive
Five-Stage Pipeline
flowchart LR Y["YAML 1.2<br/>core schema parse"] --> N["Strip nulls<br/>+ NFC normalise"] N --> S["Sort keys<br/>UTF-16 order"] S --> J["Compact JSON<br/>UTF-8, no BOM"] J --> H["Blake3 hash<br/>blake3:{64hex}"]Stage Contracts
| Stage | Input | Output | Notes |
|---|---|---|---|
| 1. Parse | YAML 1.2 string | JS object | Core schema — no custom tags |
| 2. Clean | object | object | Removes null properties; NFC-normalises every string |
| 3. Sort | object | object | Recursive key sort by UTF-16 code unit order |
| 4. Encode | object | Uint8Array | Compact JSON, UTF-8, no trailing newline, no BOM |
| 5. Hash | bytes | string | Blake3 → blake3:{64 lowercase hex} |
API Surface (@stratt/fingerprint)
canonicalise(obj) → Uint8Array— stages 2–4canonicalJson(obj) → string— stages 2–4 minus encodingfingerprintYaml(yaml) → Fingerprint— full pipeline 1–5fingerprintBytes(bytes) → Fingerprint— stage 5 only
Test Vectors
Fourteen vectors TV-01..TV-14 in canonical-serialisation-v1.md cover minimal units, empty arrays, YAML block scalars, Unicode NFC pairs, and reverse-key inputs. A new implementation passes the spec only when all 14 round-trip.
Key Terms
- NFC → Unicode Normalisation Form C (canonical composition); applied per-string at stage 2.
- SPEC_VERSION → constant
"1"exported frompackages/fingerprint/src/types.ts; bumped only on a breaking pipeline change. - Fingerprint →
{ full: string, prefix: bigint, algorithm: string }returned by stage 5.
Q&A
Q: Why UTF-16 code unit order rather than codepoint order?
A: It matches String.prototype.localeCompare defaults in JavaScript, keeping the reference implementation trivial; ports must replicate UTF-16 ordering exactly even outside JS.
Q: Why hash compact JSON instead of the original YAML? A: YAML has many equivalent surface forms (block vs flow, quoted vs bare). Funnelling to compact JSON eliminates that ambiguity before bytes reach Blake3.
Examples
A unit with keys [type, domain, slug] and another with [domain, slug, type] produce identical fingerprints because stage 3 sorts both into the same order before stage 4 encodes them.
neighbors on the map
- Packet Encoding & Compression debugging encoding errors
- Unit Schema Types authoring a new prompt unit