CRUMB a card from devarno-cloud

eva eval Cases, Assertions & Judges

eva intermediate 6 min read

ELI5

eva eval is the prompt’s school exam. Each case is one question; some questions are graded by a strict checklist (assert:), others by a teacher’s rubric (rubric:), and a single failed line drops the whole exam.

Technical Deep Dive

Case Shape (bin/eva:386-426, 714-827)

defaults:
model: claude-opus-4-7
timeout_s: 60
cases:
- name: small-file-happy-path
inputs_from: examples/happy-path # OR inline `inputs:` mapping
assert:
- contains: "<diff>"
- max_tokens: 2000
rubric: "Score the diff for surgical-ness, 1-5."
judge: { model: claude-haiku-4-5-20251001, pass_threshold: 4 }
  • Each case must have a name and at least one of inputs: or inputs_from:.
  • Each case must have at least one of assert: or rubric:.
  • inputs_from is resolved against ROOT first then the prompt directory (bin/eva:600-610).

Assertion Ops (bin/eva:643-665)

OpSemantics
contains / not_containssubstring presence/absence
contains_any / contains_alllist-form variants
matches / not_matchesre.search
min_tokens / max_tokens_approx_tokens = len(text.split())
json_schemareserved; raises not implemented in v1

Judge Loop

flowchart TD
start["eva eval id"] --> trig{triggering block?}
trig -- yes --> tj["_trigger_judge per query"]
trig -- no --> casesL
tj --> casesL["for each case"]
casesL --> render["render(prompt.xml, inputs)"]
render --> claude["_claude(rendered, model, timeout)"]
claude --> a{assert present?}
a -- yes --> ck["_check_assertions"]
a -- no --> r{rubric present?}
ck --> r
r -- yes --> jdg["_judge → SCORE=n REASON=…"]
jdg --> th{"score ≥ pass_threshold (default 4)?"}
th -- no --> fail["case_pass=false"]
th -- yes --> ok
ck --> ok["case_pass=true"]
r -- no --> ok
fail --> agg
ok --> agg["aggregate; append .eval.jsonl"]

Side Effects

Every invocation appends one row to .eval.jsonl (bin/eva:820-822) with {ts, all_passed, total, passed, failed_cases}. cmd_promote reads the most recent row to gate tested → ready (eva-003).

Key Terms

  • assert — deterministic checks the harness applies to the model output.
  • rubric — natural-language criterion sent to a separate judge model along with the output.
  • pass_threshold — minimum integer score (1–5) the judge must return; default 4 (bin/eva:793).
  • inputs_from — relative directory whose inputs.yml provides the case’s variables.

Q&A

Q: Which assertion ops are recognised by _apply_assertion? A: contains, not_contains, contains_any, contains_all, matches, not_matches, max_tokens, min_tokens. json_schema is reserved and currently fails the case (bin/eva:643-665).

Q: What does the rubric judge return and how is the threshold applied? A: It must reply with SCORE=<int> REASON=<sentence>; the case passes when score ≥ pass_threshold (default 4). Unparseable output fails the case (bin/eva:696-711, 791-802).

Q: Where in the case schema does ‘inputs_from’ resolve from? A: First <repo_root>/<inputs_from>/inputs.yml, then <prompt_dir>/<inputs_from>/inputs.yml (bin/eva:603-609). Doctor accepts either path.

Examples

Inline-inputs case with a single assertion:

cases:
- name: rejects-empty
inputs: {constraints: ""}
assert:
- contains: "constraints empty"

neighbors on the map