eva eval Cases, Assertions & Judges

eva intermediate 6 min read

ELI5

eva eval is the prompt’s school exam. Each case is one question; some questions are graded by a strict checklist (assert:), others by a teacher’s rubric (rubric:), and a single failed line drops the whole exam.

Technical Deep Dive

Case Shape (`bin/eva:386-426`, `714-827`)

defaults:
  model: claude-opus-4-7
  timeout_s: 60
cases:
  - name: small-file-happy-path
    inputs_from: examples/happy-path     # OR inline `inputs:` mapping
    assert:
      - contains: "<diff>"
      - max_tokens: 2000
    rubric: "Score the diff for surgical-ness, 1-5."
    judge: { model: claude-haiku-4-5-20251001, pass_threshold: 4 }

Each case must have a name and at least one of inputs: or inputs_from:.
Each case must have at least one of assert: or rubric:.
inputs_from is resolved against ROOT first then the prompt directory (bin/eva:600-610).

Assertion Ops (`bin/eva:643-665`)

Op	Semantics
`contains` / `not_contains`	substring presence/absence
`contains_any` / `contains_all`	list-form variants
`matches` / `not_matches`	`re.search`
`min_tokens` / `max_tokens`	`_approx_tokens = len(text.split())`
`json_schema`	reserved; raises `not implemented in v1`

Judge Loop

flowchart TD
  start["eva eval id"] --> trig{triggering block?}
  trig -- yes --> tj["_trigger_judge per query"]
  trig -- no --> casesL
  tj --> casesL["for each case"]
  casesL --> render["render(prompt.xml, inputs)"]
  render --> claude["_claude(rendered, model, timeout)"]
  claude --> a{assert present?}
  a -- yes --> ck["_check_assertions"]
  a -- no --> r{rubric present?}
  ck --> r
  r -- yes --> jdg["_judge → SCORE=n REASON=…"]
  jdg --> th{"score ≥ pass_threshold (default 4)?"}
  th -- no --> fail["case_pass=false"]
  th -- yes --> ok
  ck --> ok["case_pass=true"]
  r -- no --> ok
  fail --> agg
  ok --> agg["aggregate; append .eval.jsonl"]

Side Effects

Every invocation appends one row to .eval.jsonl (bin/eva:820-822) with {ts, all_passed, total, passed, failed_cases}. cmd_promote reads the most recent row to gate tested → ready (eva-003).

Key Terms

assert — deterministic checks the harness applies to the model output.
rubric — natural-language criterion sent to a separate judge model along with the output.
pass_threshold — minimum integer score (1–5) the judge must return; default 4 (bin/eva:793).
inputs_from — relative directory whose inputs.yml provides the case’s variables.

Q&A

Q: Which assertion ops are recognised by _apply_assertion? A: contains, not_contains, contains_any, contains_all, matches, not_matches, max_tokens, min_tokens. json_schema is reserved and currently fails the case (bin/eva:643-665).

Q: What does the rubric judge return and how is the threshold applied? A: It must reply with SCORE=<int> REASON=<sentence>; the case passes when score ≥ pass_threshold (default 4). Unparseable output fails the case (bin/eva:696-711, 791-802).

Q: Where in the case schema does ‘inputs_from’ resolve from? A: First <repo_root>/<inputs_from>/inputs.yml, then <prompt_dir>/<inputs_from>/inputs.yml (bin/eva:603-609). Doctor accepts either path.

Examples

Inline-inputs case with a single assertion:

cases:
  - name: rejects-empty
    inputs: {constraints: ""}
    assert:
      - contains: "constraints empty"

neighbors on the map

Triggering Tests for Skill Auto-Load tuning meta.description so the right queries auto-load the skill
Promotion Lifecycle Gates promoting a prompt from draft to tested
JSON Schema 2020-12 & Validation Pipeline validating sprite/council/chain definitions
Operations & Versions Schema writing a new sync query