CRUMB a card from devarno-cloud

Ralph Convergence Loop

kahn intermediate 5 min read

ELI5

Re-running a recipe up to six times, sleeping a little longer each time, and only declaring it cooked when every taste-tester gives a thumbs-up. If the first try failed but the second worked, that counts as a “flake” — written down separately from a clean win.

Technical Deep Dive

The inner loop in core/orchestrator.py::ralph() re-invokes the prompt until done_when converges or max_ralph_iters (default 6) is reached.

Per-Attempt Sequence

sequenceDiagram
autonumber
participant S as Scheduler
participant R as ralph()
participant C as claude CLI
participant DW as done_when[]
participant E as Emitter
S->>R: invoke(node)
loop attempt i in 1..max_ralph_iters
alt retry attempt
R->>R: sleep min(2^(i-1), 60)s
end
R->>C: claude --print --prompt-file
C-->>R: rc, stdout, stderr
R->>DW: run each cmd, collect rc+duration
DW-->>R: ok / not ok
R->>E: node_attempt(i, dur, ok, results, backoff_s)
alt ok
R-->>S: True (converged)
end
end
R-->>S: False (max_ralph_iters_reached)

Convergence Rule

check_convergence runs every done_when shell command. All must exit 0 for the attempt to converge. Any non-zero rc captures tail (stderr+stdout, last 4096 bytes), with truncated=true set when the cap is hit.

Backoff

Backoff before attempt i (i ≥ 2) is min(2^(i-1), 60) seconds. Attempt 1 has no backoff_s.

Flakiness Bookkeeping

flowchart LR
A["attempt 1 fails"] --> B["attempt 2 succeeds"]
B --> C["flake_retries += 1"]
D["attempt 1 succeeds"] --> E["flake_retries += 0"]

flake_retries increments only when a later attempt converges after an earlier failure. The counter feeds run_end.flake_retries and gates the clean_with_flake outcome.

Key Terms

  • done_when → Ordered list of shell commands; all must exit 0 for convergence.
  • max_ralph_iters → Per-node attempt cap (graph default 6); reaching it ⇒ failed with reason="max_ralph_iters_reached".
  • backoff_s → Pre-attempt sleep, exponential and capped at 60s.

Q&A

Q: Does the orchestrator care about claude’s exit code? A: No. rc from the CLI is logged but ignored — convergence is decided by done_when, not by claude. This is the “convergence, not exit zero” tenet.

Q: How big can a failure tail get? A: Capped at 4096 bytes (last 4 KiB). When trimmed, the event sets truncated: true.

Q: Where does the per-attempt log file live? A: <state_dir>/<node.id>.log, overwritten each attempt with rc, stdout, stderr, and the convergence verdict.

Examples

A flaky test node with done_when=["pnpm test src/db/auth.test.ts"] fails on attempt 1 (network blip), waits 2s, retries and succeeds. Result: converged=true on attempt 2, flake_retries+=1, run-end outcome promoted from clean to clean_with_flake.

neighbors on the map