Horizontal Scalability Seams
kahn advanced 6 min read
ELI5
The system itself is a one-room shop, but it has two doors that fit standard shipping containers. Want more throughput? Stack more containers behind the doors — the shop machinery doesn’t change.
Technical Deep Dive
KAHN Scope is a single-process, single-port, localhost-only product (SRS §2/§9/§12). Horizontal scalability is a property of two seams, not the process itself.
Seam 1 — EventSource Trait
backend/kahn/eventsource.py defines a Protocol with three implementations:
flowchart TD P[EventSource Protocol] P --> A[ArchivedEventSource: replay finished transitions.jsonl] P --> B[LiveTailEventSource: tail growing transitions.jsonl] P --> C["PartitionedStreamEventSource (future): one partition per run_id"]A third impl is ~200 LOC behind the existing Protocol. Aggregator, archive, diagnostics, and server layers do not change. That is the ingestion-side scalability story.
Seam 2 — Partition-Keyed Archive
.kahn/archive/runs/<run_id>/{transitions.jsonl, summary.json, graph.json}A fleet deployment can:
- Mount
/data/.kahn/archive/runs/on a shared filesystem (NFS, EFS, S3-fuse). - Back it with an S3 prefix of identical shape.
- Shard write traffic across instances by hashing
run_id, each KAHN instance owning a subset of the keyspace.
The read path (GET /api/runs, GET /api/runs/<id>/*) is a pure prefix scan. Parallelises trivially: no cross-run joins; diagnostics operate on history_index, itself a reduce over summary.json rows.
What Stays Single-Process
| Surface | Why |
|---|---|
localhost:8080 WebSocket | SRS §12 single-operator, single-machine |
| In-memory aggregator cache | Per-instance, rebuildable on boot |
| Forecast heuristics | Suppressed below 5 runs of history (I-7) |
What Explicitly Does Not Exist
- No database (JSONL on disk + in-memory scan is the query layer — I-9).
- No server-side auth / multi-tenancy in OSS mode (SRS §12 non-goal).
- No cluster coordinator. Instances are independent readers of a shared partition space.
Cost Of The Seams
| Seam | LOC | Verification |
|---|---|---|
EventSource trait | ~180 LOC in eventsource.py | Unit tests per impl; Protocol type-checked |
| Partition-keyed archive | ~0 LOC — already the shape | strace invariant: every write under .kahn/archive/runs/<run_id>/** |
Key Terms
- Seam → A boundary deliberately designed so a future implementation can swap behind it without rippling through the codebase.
- I-8 → “Aggregator/diagnostics/server layers don’t change between modes.” Exercised, not aspirational.
- Cross-run join → Explicitly absent — diagnostics never join two runs at the row level, only at the
summary.jsonreduce level.
Q&A
Q: How would I shard live writers across two KAHN instances?
A: Hash run_id to a shard; each instance owns a subset and writes only into .kahn/archive/runs/<id>/ for IDs it owns. Readers don’t care — both shards live under the same prefix.
Q: Why no cluster coordinator?
A: There’s nothing to coordinate. Producers append to disjoint files (one per run_id), and readers scan a prefix. Coordination would only be needed if writes raced on the same file, which the partition design forbids.
Q: What’s the marginal LOC cost vs. “just use Postgres + k8s”? A: Under 200 LOC and one documented Protocol. The alternative deletes the single-operator OSS product the SRS ships.
Examples
Onboarding a third producer (e.g. traceo-cat): mount the producer’s working dir to /data/.kahn/archive/runs/, point its emitter at the partitioned path, restart the same KAHN Scope binary. Zero code change inside KAHN.
neighbors on the map
- OSS & Cloud Modes self-hosting via docker compose
- In-Process Rate-Limit Bucket investigating ingest 429s