Horizontal Scalability Seams

kahn advanced 6 min read

ELI5

The system itself is a one-room shop, but it has two doors that fit standard shipping containers. Want more throughput? Stack more containers behind the doors — the shop machinery doesn’t change.

Technical Deep Dive

KAHN Scope is a single-process, single-port, localhost-only product (SRS §2/§9/§12). Horizontal scalability is a property of two seams, not the process itself.

Seam 1 — `EventSource` Trait

backend/kahn/eventsource.py defines a Protocol with three implementations:

flowchart TD
    P[EventSource Protocol]
    P --> A[ArchivedEventSource: replay finished transitions.jsonl]
    P --> B[LiveTailEventSource: tail growing transitions.jsonl]
    P --> C["PartitionedStreamEventSource (future): one partition per run_id"]

A third impl is ~200 LOC behind the existing Protocol. Aggregator, archive, diagnostics, and server layers do not change. That is the ingestion-side scalability story.

Seam 2 — Partition-Keyed Archive

.kahn/archive/runs/<run_id>/{transitions.jsonl, summary.json, graph.json}

A fleet deployment can:

Mount /data/.kahn/archive/runs/ on a shared filesystem (NFS, EFS, S3-fuse).
Back it with an S3 prefix of identical shape.
Shard write traffic across instances by hashing run_id, each KAHN instance owning a subset of the keyspace.

The read path (GET /api/runs, GET /api/runs/<id>/*) is a pure prefix scan. Parallelises trivially: no cross-run joins; diagnostics operate on history_index, itself a reduce over summary.json rows.

What Stays Single-Process

Surface	Why
`localhost:8080` WebSocket	SRS §12 single-operator, single-machine
In-memory aggregator cache	Per-instance, rebuildable on boot
Forecast heuristics	Suppressed below 5 runs of history (I-7)

What Explicitly Does Not Exist

No database (JSONL on disk + in-memory scan is the query layer — I-9).
No server-side auth / multi-tenancy in OSS mode (SRS §12 non-goal).
No cluster coordinator. Instances are independent readers of a shared partition space.

Cost Of The Seams

Seam	LOC	Verification
`EventSource` trait	~180 LOC in `eventsource.py`	Unit tests per impl; `Protocol` type-checked
Partition-keyed archive	~0 LOC — already the shape	`strace` invariant: every write under `.kahn/archive/runs/<run_id>/**`

Key Terms

Seam → A boundary deliberately designed so a future implementation can swap behind it without rippling through the codebase.
I-8 → “Aggregator/diagnostics/server layers don’t change between modes.” Exercised, not aspirational.
Cross-run join → Explicitly absent — diagnostics never join two runs at the row level, only at the summary.json reduce level.

Q&A

Q: How would I shard live writers across two KAHN instances? A: Hash run_id to a shard; each instance owns a subset and writes only into .kahn/archive/runs/<id>/ for IDs it owns. Readers don’t care — both shards live under the same prefix.

Q: Why no cluster coordinator? A: There’s nothing to coordinate. Producers append to disjoint files (one per run_id), and readers scan a prefix. Coordination would only be needed if writes raced on the same file, which the partition design forbids.

Q: What’s the marginal LOC cost vs. “just use Postgres + k8s”? A: Under 200 LOC and one documented Protocol. The alternative deletes the single-operator OSS product the SRS ships.

Examples

Onboarding a third producer (e.g. traceo-cat): mount the producer’s working dir to /data/.kahn/archive/runs/, point its emitter at the partitioned path, restart the same KAHN Scope binary. Zero code change inside KAHN.

neighbors on the map

OSS & Cloud Modes self-hosting via docker compose
In-Process Rate-Limit Bucket investigating ingest 429s