Blake3 5-Stage Canonical Fingerprinting (DEC-005)

iris advanced 7 min read

ELI5

Imagine you want to give every person a unique ID number based on everything about them — but you don’t include their ID card number or the date they got it, because those change. You write down their name, skills, and job description in the exact same order every time, convert it to a special code, and run it through a super-fast math machine called Blake3. The result is a 64-character fingerprint that proves “this is exactly this person.”

Technical Deep Dive

The 5-Stage Canonical Pipeline (DEC-005)

flowchart LR
    A["Stage 1<br/>YAML/JSON Parse"] --> B["Stage 2<br/>Object Transform"]
    B --> C["Stage 3<br/>Canonical JSON"]
    C --> D["Stage 4<br/>UTF-8 Encode"]
    D --> E["Stage 5<br/>Blake3 Hash"]
    E --> F["64-char hex digest"]

Stage 1: YAML/JSON Parse

Input is parsed into a native Python object (dict/list/scalar) using strict YAML 1.2 or JSON parsing. This happens externally before FingerprintEngine.compute_hash() is called.

Stage 2: Object Transform

Recursively normalizes the data structure for deterministic hashing:

def _normalize_object(obj):
    if isinstance(obj, dict):
        # Sort keys alphabetically
        # Exclude: id, fingerprint, $schema
        return {
            k: _normalize_object(v)
            for k, v in sorted(obj.items())
            if k not in ("id", "fingerprint", "$schema")
        }
    elif isinstance(obj, list):
        return [_normalize_object(item) for item in obj]
    else:
        return obj  # Scalars pass through

Stripped fields:

id — server-generated, non-deterministic
fingerprint — circular (hash of a hash)
$schema — metadata, not semantic content
created — timestamp, changes on every creation

Stage 3: Canonical JSON Serialization

canonical = json.dumps(
    normalized,
    separators=(",", ":"),     # No whitespace
    sort_keys=True,             # Deterministic key order
    ensure_ascii=True           # ASCII-only output
)

This produces a compact, fully deterministic JSON string with no formatting variation.

Stage 4: UTF-8 Encoding

encoded = canonical.encode("utf-8")  # No BOM, no trailing newline

Stage 5: Blake3 Hash

import blake3
digest = blake3.blake3(encoded).hexdigest()  # 64-char lowercase hex

SHA-256 fallback: If blake3 is not installed, falls back to hashlib.sha256(encoded).hexdigest().

Performance Comparison

Algorithm	Speed (single-core)	Security	Setup
Blake3	~3 GB/s	128-bit	Trustless (no setup)
SHA-256	~1 GB/s	128-bit	Trustless
SHA-1	~2 GB/s	Broken	Trustless

Blake3 was chosen because it is approximately 3× faster than SHA-256 while maintaining equivalent cryptographic strength. STRATT already uses Blake3 with 14 test vectors and 98 tests, enabling cross-system reuse.

Fingerprint Format

packet-beta
  title Fingerprint Structure
  0-5: "prefix"
  6-69: "hash"

The stored fingerprint is a string with a colon separator:

blake3:a1b2c3d4... (64 hex chars after colon)
sha256:a1b2c3d4... (fallback)

Verification Flow

sequenceDiagram
    participant Client
    participant Service as iris-service
    participant Engine as FingerprintEngine
    participant SDK as iris-sdk

    Client->>Service: GET /v1/sprites/{id}/fingerprint
    Service->>Engine: verify_fingerprint(sprite, stored_hash)
    Engine->>SDK: compute_hash(sprite)
    SDK->>SDK: Stage 1: Parse
    SDK->>SDK: Stage 2: Normalize (strip id/fingerprint/$schema)
    SDK->>SDK: Stage 3: Canonical JSON
    SDK->>SDK: Stage 4: UTF-8 encode
    SDK->>SDK: Stage 5: Blake3 hash
    SDK-->>Engine: computed_hash
    Engine->>Engine: case-insensitive comparison
    Engine-->>Service: verified: true/false
    Service-->>Client: Fingerprint {verified, stored, computed, timestamp}

STRATT Compatibility

The 5-stage pipeline is identical to STRATT’s canonical hashing, enabling cross-system verification:

# IRIS computes fingerprint
iris_hash = FingerprintEngine.compute_hash(sprite)

# MERIDIAN computes fingerprint from same canonical data
meridian_hash = StrattFingerprintEngine.compute_hash(sprite)

# If both follow DEC-005: iris_hash == meridian_hash

This is the foundation of the iris-meridian-adapter’s VerifyFingerprint RPC.

Key Terms

Canonical serialization → A deterministic, unambiguous text representation of data (same input always produces same bytes)
Blake3 → A cryptographic hash function ~3× faster than SHA-256, trustless, no setup required
5-stage pipeline → The canonical hash process: parse → normalize → canonical JSON → UTF-8 → hash
Fingerprint → A 256-bit (64 hex char) identity hash that uniquely identifies a sprite’s semantic content
STRATT-compatible → Uses the same canonical pipeline as the STRATT protocol, enabling cross-system verification
SHA-256 fallback → Alternative hash algorithm if Blake3 library is unavailable

Q&A

Q: If I change only whitespace in a YAML file, does the fingerprint change? A: No. Stage 2 normalizes the parsed object (not the raw text), and Stage 3 uses compact JSON with no whitespace. Formatting changes do not affect the fingerprint.

Q: What if I add a new field to the sprite? A: The fingerprint will change because the canonical representation now includes the new field. This is intentional — it proves the sprite’s content has semantically changed.

Q: Why exclude id and created from the hash? A: These are server-generated, non-deterministic fields. Two logically identical sprites created at different times should have the same fingerprint. Only semantic fields (name, version, capabilities, system_prompt, etc.) are hashed.

Q: Can two different sprites produce the same fingerprint? A: In theory, hash collisions are possible (probability ~2^-256 for Blake3). In practice, this is cryptographically negligible — less likely than a cosmic ray flipping a bit in your CPU.

Q: How does the TypeScript SDK compute fingerprints? A: The TypeScript SDK defines FingerprintData types but delegates fingerprint computation to the server. The Python SDK (iris-sdk-python) contains the canonical implementation used by both the SDK CLI and iris-service.

Examples

Fingerprinting is like a master bookbinder identifying a first edition:

You don’t look at the library checkout stamp (id) or the date it was acquired (created)
You examine the actual content: paper type, typeface, ink composition, page dimensions
You record these in the exact same order every time (canonical)
You run them through a precise chemical analysis (Blake3)
The result is a unique “fingerprint” that proves “this is a genuine 1925 first edition, not a 1970 reprint”

neighbors on the map

STRATT Protocol Overview learning STRATT for the first time
FNP Insert Operation Complete Flow understanding the end-to-end insert operation