CRUMB a card from devarno-cloud

Horizontal Scalability Seams

kahn advanced 6 min read

ELI5

The system itself is a one-room shop, but it has two doors that fit standard shipping containers. Want more throughput? Stack more containers behind the doors — the shop machinery doesn’t change.

Technical Deep Dive

KAHN Scope is a single-process, single-port, localhost-only product (SRS §2/§9/§12). Horizontal scalability is a property of two seams, not the process itself.

Seam 1 — EventSource Trait

backend/kahn/eventsource.py defines a Protocol with three implementations:

flowchart TD
P[EventSource Protocol]
P --> A[ArchivedEventSource: replay finished transitions.jsonl]
P --> B[LiveTailEventSource: tail growing transitions.jsonl]
P --> C["PartitionedStreamEventSource (future): one partition per run_id"]

A third impl is ~200 LOC behind the existing Protocol. Aggregator, archive, diagnostics, and server layers do not change. That is the ingestion-side scalability story.

Seam 2 — Partition-Keyed Archive

.kahn/archive/runs/<run_id>/{transitions.jsonl, summary.json, graph.json}

A fleet deployment can:

  • Mount /data/.kahn/archive/runs/ on a shared filesystem (NFS, EFS, S3-fuse).
  • Back it with an S3 prefix of identical shape.
  • Shard write traffic across instances by hashing run_id, each KAHN instance owning a subset of the keyspace.

The read path (GET /api/runs, GET /api/runs/<id>/*) is a pure prefix scan. Parallelises trivially: no cross-run joins; diagnostics operate on history_index, itself a reduce over summary.json rows.

What Stays Single-Process

SurfaceWhy
localhost:8080 WebSocketSRS §12 single-operator, single-machine
In-memory aggregator cachePer-instance, rebuildable on boot
Forecast heuristicsSuppressed below 5 runs of history (I-7)

What Explicitly Does Not Exist

  • No database (JSONL on disk + in-memory scan is the query layer — I-9).
  • No server-side auth / multi-tenancy in OSS mode (SRS §12 non-goal).
  • No cluster coordinator. Instances are independent readers of a shared partition space.

Cost Of The Seams

SeamLOCVerification
EventSource trait~180 LOC in eventsource.pyUnit tests per impl; Protocol type-checked
Partition-keyed archive~0 LOC — already the shapestrace invariant: every write under .kahn/archive/runs/<run_id>/**

Key Terms

  • Seam → A boundary deliberately designed so a future implementation can swap behind it without rippling through the codebase.
  • I-8 → “Aggregator/diagnostics/server layers don’t change between modes.” Exercised, not aspirational.
  • Cross-run join → Explicitly absent — diagnostics never join two runs at the row level, only at the summary.json reduce level.

Q&A

Q: How would I shard live writers across two KAHN instances? A: Hash run_id to a shard; each instance owns a subset and writes only into .kahn/archive/runs/<id>/ for IDs it owns. Readers don’t care — both shards live under the same prefix.

Q: Why no cluster coordinator? A: There’s nothing to coordinate. Producers append to disjoint files (one per run_id), and readers scan a prefix. Coordination would only be needed if writes raced on the same file, which the partition design forbids.

Q: What’s the marginal LOC cost vs. “just use Postgres + k8s”? A: Under 200 LOC and one documented Protocol. The alternative deletes the single-operator OSS product the SRS ships.

Examples

Onboarding a third producer (e.g. traceo-cat): mount the producer’s working dir to /data/.kahn/archive/runs/, point its emitter at the partitioned path, restart the same KAHN Scope binary. Zero code change inside KAHN.

neighbors on the map