CRUMB a card from devarno-cloud

RALPH on KAHN Transitions Schema

rocky intermediate 5 min read

ELI5

KAHN already defines a standard ledger format for “node ran, here’s what happened” events. Rocky chose not to invent a second format — RALPH writes its events in KAHN’s shape, smuggling the few RALPH-specific fields through KAHN’s documented “unknown fields pass through” door.

Technical Deep Dive

Why Rocky adopted KAHN’s schema

KAHN ships a vendored producer→consumer contract at kahn-hq/contracts/: transitions.schema.json, graph.schema.json, and the stdlib-only kahn_emit.py. Sister repos vendor it directly. Per system-redesign §KAHN integration, Phase 3 made RALPH a consumer, not a fork:

  • Each Rocky run = one KAHN run.
  • Each prompt = one node.
  • plan / execute / audit = node attempts.
  • Deviation severity → Outcome enum (clean | clean_with_flake | partial | stuck | catastrophic).
  • RALPH-specific deviation metadata rides as KAHN’s documented “unknown fields pass through” extension.

What lives where

flowchart LR
subgraph KAHN[kahn-hq/contracts/]
S[transitions.schema.json]
G[graph.schema.json]
E[kahn_emit.py]
end
subgraph RALPH[rocky-hq/ralph/]
V[vendored kahn_emit.py]
P[pydantic models<br/>per Phase 4 D6]
end
subgraph Contracts[rocky-hq/contracts/]
Z[zod KahnEventSchema]
FX[KAHN parity test]
end
subgraph Console[rocky-hq/console/]
Re["@rocky/contracts/ralph"]
Pa[parseKahnEvent at SSE boundary]
end
S --> V
E --> V
S -->|diff| FX
Z --> Re
Z --> Pa
P -->|JSON Schema| FX

Schemas in scope (Phase 4 §5)

SchemaWire boundary
NodeAttemptSchemaKAHN journal frame
NodeTransitionSchemaKAHN journal frame
RunStartSchema / RunEndSchemaKAHN journal envelope
KahnEventSchemadiscriminated union of the four
OutcomeSchemaKILN-extensible enum

KILN-extensibility: convergence_score? and early_stop_reason? are .optional() in zod and surface as | undefined in TS. Consumers without KILN render no-op slots; consumers with KILN are forward-compatible without a schema bump.

Outcome fold (Phase 3e D7)

Pass-rate is binary; the five-valued enum needs an explicit fold:

OutcomeFolds to
cleanpass
clean_with_flakepass (a known retry)
partialfail
stuckfail
catastrophicfail

The fold lives in console/src/lib/workspace/ralph-runs.ts:aggregateRuns24h and feeds the dash/ralph-runs panel.

Class diagram

classDiagram
class KahnEvent {
<<discriminated union>>
type
}
class RunStart {
+string run_id
+string ts
+string actor
}
class NodeAttempt {
+string run_id
+string node_id
+int attempt
+string ts
}
class NodeTransition {
+string run_id
+string node_id
+Outcome outcome
+number? convergence_score
+string? early_stop_reason
}
class RunEnd {
+string run_id
+Outcome outcome
+string ts
}
class Outcome {
<<enum>>
clean
clean_with_flake
partial
stuck
catastrophic
}
KahnEvent <|-- RunStart
KahnEvent <|-- NodeAttempt
KahnEvent <|-- NodeTransition
KahnEvent <|-- RunEnd
NodeTransition --> Outcome

Two-source-of-truth pattern (Phase 4 D6)

Zod is the source of truth for TypeScript; pydantic is the source for Python. Contracts CI runs a parity test that diffs the JSON Schema emitted from each. There is no codegen step in the consumer — both sides own their own models, contracts proves they agree.

For KAHN, a separate parity test diffs the emitted JSON Schema against upstream kahn-hq/contracts/transitions.schema.json so RALPH cannot silently drift.

Key Terms

  • kahn_emit.py → stdlib-only producer helper from kahn-hq, vendored into ralph/ (not re-vendored in contracts/)
  • KILN slot → optional outcome metadata (convergence_score, early_stop_reason) that downstream consumers may surface
  • Passthrough → zod .passthrough() on nested objects so unknown fields ride through without a schema bump
  • Parity test → CI check that emits JSON Schema from both source-of-truth representations and diffs them

Q&A

Q: Why doesn’t Rocky just re-vendor transitions.schema.json in @rocky/contracts? A: Phase 4 D4 — KAHN stays vendored upstream. Re-vendoring would create two sources of truth and a sync-or-die maintenance load. The parity test catches drift without forking.

Q: How does a RALPH-specific field ride a KAHN frame? A: As an unknown key on a passthrough-enabled object. KAHN consumers ignore it; RALPH consumers recognise it. No schema bump required.

Q: What does clean_with_flake mean for pass rate? A: It folds to pass. It is a successful run that needed a retry — the retry is interesting for flake telemetry but the run completed cleanly.

Examples

Two airlines using the same standard luggage-tag format (KAHN): each carrier prints its airline-specific data in the optional comments field. A baggage handler reading a tag understands the standard fields from any carrier and can still see — but ignore — the airline-specific notes when routing across hubs.

neighbors on the map