CRUMB a card from devarno-cloud

Run Outcome Classification

kahn beginner 3 min read

ELI5

A scorecard at the end of the night: everything cooked clean? perfect. Everything cooked but two dishes had to be re-cooked? clean-with-a-flake. Some dishes done and others stuck? partial. Every dish stuck? catastrophic.

Technical Deep Dive

core/orchestrator.py::_derive_outcome(counts, flake_retries) maps (done, failed, blocked, flake_retries) → one of five outcome strings on the run_end event.

Decision Tree

flowchart TD
A[counts: done, failed, blocked + flake_retries] --> B{failed==0 and blocked==0?}
B -->|yes| C{flake_retries > 0?}
C -->|no| C1[clean]
C -->|yes| C2[clean_with_flake]
B -->|no| D{done==0?}
D -->|yes| D1[catastrophic]
D -->|no| E{done>0 and failed>0 and blocked>0?}
E -->|yes| E1[stuck]
E -->|no| E2[partial]

Outcome Table

Outcomedonefailedblockedexitmeaning
cleann000Every node converged on first attempt
clean_with_flaken000Every node converged, ≥1 needed a retry
partial>0≥0≥0 (not all three)1Some nodes done, some unfinished
stuck>0>0>01All three counts non-zero — a true mess
catastrophic0nn1Nothing made it across

Process Exit Code

exit_code = 0 if outcome in ("clean", "clean_with_flake") else 1

run_end.exit_code follows the same rule. The schema elides 0 and only emits the field on non-clean outcomes.

Key Terms

  • flake_retries → Sum across nodes of attempts that succeeded after at least one prior failure.
  • stuck → The “all three non-zero” outcome — both progress and pathology in one run.
  • catastrophic → No nodes converged; usually a misconfigured root.

Q&A

Q: Can partial have failed == 0? A: Yes — if done > 0 and blocked > 0 but no node failed (e.g. an upstream timeout left descendants unable to start), the run is partial, not stuck.

Q: Is clean_with_flake a successful run? A: Yes. Process exits 0; History UI renders it as a green pill with a flake glyph. Flake bookkeeping is a diagnostic signal, not a failure.

Q: Why does stuck require all three of done/failed/blocked > 0? A: It’s the “mixed reality” case — work happened, work broke, work was blocked. Distinct from partial (only two of three) so diagnostics can target it specifically.

Examples

Five-node run: done=4, failed=1, blocked=0, flake_retries=2. Branch B fails (failed≠0), branch D evaluates done==0? No. Branch E evaluates done>0 ∧ failed>0 ∧ blocked>0? blocked is 0 → falls to partial. Exit code 1.

neighbors on the map