Run Outcome Classification
kahn beginner 3 min read
ELI5
A scorecard at the end of the night: everything cooked clean? perfect. Everything cooked but two dishes had to be re-cooked? clean-with-a-flake. Some dishes done and others stuck? partial. Every dish stuck? catastrophic.
Technical Deep Dive
core/orchestrator.py::_derive_outcome(counts, flake_retries) maps (done, failed, blocked, flake_retries) → one of five outcome strings on the run_end event.
Decision Tree
flowchart TD A[counts: done, failed, blocked + flake_retries] --> B{failed==0 and blocked==0?} B -->|yes| C{flake_retries > 0?} C -->|no| C1[clean] C -->|yes| C2[clean_with_flake] B -->|no| D{done==0?} D -->|yes| D1[catastrophic] D -->|no| E{done>0 and failed>0 and blocked>0?} E -->|yes| E1[stuck] E -->|no| E2[partial]Outcome Table
| Outcome | done | failed | blocked | exit | meaning |
|---|---|---|---|---|---|
clean | n | 0 | 0 | 0 | Every node converged on first attempt |
clean_with_flake | n | 0 | 0 | 0 | Every node converged, ≥1 needed a retry |
partial | >0 | ≥0 | ≥0 (not all three) | 1 | Some nodes done, some unfinished |
stuck | >0 | >0 | >0 | 1 | All three counts non-zero — a true mess |
catastrophic | 0 | n | n | 1 | Nothing made it across |
Process Exit Code
exit_code = 0 if outcome in ("clean", "clean_with_flake") else 1run_end.exit_code follows the same rule. The schema elides 0 and only emits the field on non-clean outcomes.
Key Terms
- flake_retries → Sum across nodes of attempts that succeeded after at least one prior failure.
- stuck → The “all three non-zero” outcome — both progress and pathology in one run.
- catastrophic → No nodes converged; usually a misconfigured root.
Q&A
Q: Can partial have failed == 0?
A: Yes — if done > 0 and blocked > 0 but no node failed (e.g. an upstream timeout left descendants unable to start), the run is partial, not stuck.
Q: Is clean_with_flake a successful run?
A: Yes. Process exits 0; History UI renders it as a green pill with a flake glyph. Flake bookkeeping is a diagnostic signal, not a failure.
Q: Why does stuck require all three of done/failed/blocked > 0?
A: It’s the “mixed reality” case — work happened, work broke, work was blocked. Distinct from partial (only two of three) so diagnostics can target it specifically.
Examples
Five-node run: done=4, failed=1, blocked=0, flake_retries=2. Branch B fails (failed≠0), branch D evaluates done==0? No. Branch E evaluates done>0 ∧ failed>0 ∧ blocked>0? blocked is 0 → falls to partial. Exit code 1.
neighbors on the map
- CI Transition Event Schema vendoring kahn_emit.py into a CI producer