eva eval Cases, Assertions & Judges
eva intermediate 6 min read
ELI5
eva eval is the prompt’s school exam. Each case is one question; some questions are graded by a strict checklist (assert:), others by a teacher’s rubric (rubric:), and a single failed line drops the whole exam.
Technical Deep Dive
Case Shape (bin/eva:386-426, 714-827)
defaults: model: claude-opus-4-7 timeout_s: 60cases: - name: small-file-happy-path inputs_from: examples/happy-path # OR inline `inputs:` mapping assert: - contains: "<diff>" - max_tokens: 2000 rubric: "Score the diff for surgical-ness, 1-5." judge: { model: claude-haiku-4-5-20251001, pass_threshold: 4 }- Each case must have a
nameand at least one ofinputs:orinputs_from:. - Each case must have at least one of
assert:orrubric:. inputs_fromis resolved againstROOTfirst then the prompt directory (bin/eva:600-610).
Assertion Ops (bin/eva:643-665)
| Op | Semantics |
|---|---|
contains / not_contains | substring presence/absence |
contains_any / contains_all | list-form variants |
matches / not_matches | re.search |
min_tokens / max_tokens | _approx_tokens = len(text.split()) |
json_schema | reserved; raises not implemented in v1 |
Judge Loop
flowchart TD start["eva eval id"] --> trig{triggering block?} trig -- yes --> tj["_trigger_judge per query"] trig -- no --> casesL tj --> casesL["for each case"] casesL --> render["render(prompt.xml, inputs)"] render --> claude["_claude(rendered, model, timeout)"] claude --> a{assert present?} a -- yes --> ck["_check_assertions"] a -- no --> r{rubric present?} ck --> r r -- yes --> jdg["_judge → SCORE=n REASON=…"] jdg --> th{"score ≥ pass_threshold (default 4)?"} th -- no --> fail["case_pass=false"] th -- yes --> ok ck --> ok["case_pass=true"] r -- no --> ok fail --> agg ok --> agg["aggregate; append .eval.jsonl"]Side Effects
Every invocation appends one row to .eval.jsonl (bin/eva:820-822) with {ts, all_passed, total, passed, failed_cases}. cmd_promote reads the most recent row to gate tested → ready (eva-003).
Key Terms
- assert — deterministic checks the harness applies to the model output.
- rubric — natural-language criterion sent to a separate judge model along with the output.
- pass_threshold — minimum integer score (1–5) the judge must return; default 4 (
bin/eva:793). - inputs_from — relative directory whose
inputs.ymlprovides the case’s variables.
Q&A
Q: Which assertion ops are recognised by _apply_assertion?
A: contains, not_contains, contains_any, contains_all, matches, not_matches, max_tokens, min_tokens. json_schema is reserved and currently fails the case (bin/eva:643-665).
Q: What does the rubric judge return and how is the threshold applied?
A: It must reply with SCORE=<int> REASON=<sentence>; the case passes when score ≥ pass_threshold (default 4). Unparseable output fails the case (bin/eva:696-711, 791-802).
Q: Where in the case schema does ‘inputs_from’ resolve from?
A: First <repo_root>/<inputs_from>/inputs.yml, then <prompt_dir>/<inputs_from>/inputs.yml (bin/eva:603-609). Doctor accepts either path.
Examples
Inline-inputs case with a single assertion:
cases: - name: rejects-empty inputs: {constraints: ""} assert: - contains: "constraints empty"neighbors on the map
- Triggering Tests for Skill Auto-Load tuning meta.description so the right queries auto-load the skill
- Promotion Lifecycle Gates promoting a prompt from draft to tested
- JSON Schema 2020-12 & Validation Pipeline validating sprite/council/chain definitions
- Operations & Versions Schema writing a new sync query