Ralph Convergence Loop
kahn intermediate 5 min read
ELI5
Re-running a recipe up to six times, sleeping a little longer each time, and only declaring it cooked when every taste-tester gives a thumbs-up. If the first try failed but the second worked, that counts as a “flake” — written down separately from a clean win.
Technical Deep Dive
The inner loop in core/orchestrator.py::ralph() re-invokes the prompt until done_when converges or max_ralph_iters (default 6) is reached.
Per-Attempt Sequence
sequenceDiagram autonumber participant S as Scheduler participant R as ralph() participant C as claude CLI participant DW as done_when[] participant E as Emitter
S->>R: invoke(node) loop attempt i in 1..max_ralph_iters alt retry attempt R->>R: sleep min(2^(i-1), 60)s end R->>C: claude --print --prompt-file C-->>R: rc, stdout, stderr R->>DW: run each cmd, collect rc+duration DW-->>R: ok / not ok R->>E: node_attempt(i, dur, ok, results, backoff_s) alt ok R-->>S: True (converged) end end R-->>S: False (max_ralph_iters_reached)Convergence Rule
check_convergence runs every done_when shell command. All must exit 0 for the attempt to converge. Any non-zero rc captures tail (stderr+stdout, last 4096 bytes), with truncated=true set when the cap is hit.
Backoff
Backoff before attempt i (i ≥ 2) is min(2^(i-1), 60) seconds. Attempt 1 has no backoff_s.
Flakiness Bookkeeping
flowchart LR A["attempt 1 fails"] --> B["attempt 2 succeeds"] B --> C["flake_retries += 1"] D["attempt 1 succeeds"] --> E["flake_retries += 0"]flake_retries increments only when a later attempt converges after an earlier failure. The counter feeds run_end.flake_retries and gates the clean_with_flake outcome.
Key Terms
- done_when → Ordered list of shell commands; all must
exit 0for convergence. - max_ralph_iters → Per-node attempt cap (graph default 6); reaching it ⇒
failedwithreason="max_ralph_iters_reached". - backoff_s → Pre-attempt sleep, exponential and capped at 60s.
Q&A
Q: Does the orchestrator care about claude’s exit code?
A: No. rc from the CLI is logged but ignored — convergence is decided by done_when, not by claude. This is the “convergence, not exit zero” tenet.
Q: How big can a failure tail get?
A: Capped at 4096 bytes (last 4 KiB). When trimmed, the event sets truncated: true.
Q: Where does the per-attempt log file live?
A: <state_dir>/<node.id>.log, overwritten each attempt with rc, stdout, stderr, and the convergence verdict.
Examples
A flaky test node with done_when=["pnpm test src/db/auth.test.ts"] fails on attempt 1 (network blip), waits 2s, retries and succeeds. Result: converged=true on attempt 2, flake_retries+=1, run-end outcome promoted from clean to clean_with_flake.
neighbors on the map
- Run Outcome Classification interpreting a History row's status pill
- End-to-End Chain Execution Request Flow tracing a chain execution through the entire system
- CAIRNET Reaction System understanding how agents signal approval/disagreement