CRUMB a card from devarno-cloud

Triggering Tests for Skill Auto-Load

eva intermediate 4 min read

ELI5

A separate small judge reads the prompt’s description card — not the prompt itself — and decides whether a sample user request would trigger it. You list the requests you want it to grab (should_match) and the ones you want it to ignore (should_not_match); a YES/NO that disagrees fails the eval.

Technical Deep Dive

Block Shape (bin/eva:386-405, 734-760)

triggering:
judge:
model: claude-haiku-4-5-20251001 # falls back to defaults.model
should_match:
- "refactor src/foo.py to remove the global state"
should_not_match:
- "explain what this file does"

eva doctor requires the block to be a mapping and to contain at least one of should_match / should_not_match; both lists must be non-empty strings.

Judge Prompt Construction (bin/eva:668-693)

The judge receives a fixed wrapper plus three composed sections:

  • DESCRIPTION:meta.description (or meta.summary as fallback).
  • POSITIVE TRIGGERS: — bulleted meta.triggers.
  • NEGATIVE TRIGGERS (do NOT use for): — bulleted meta.not_for.
  • USER QUERY: — the test string.

It must reply DECISION=<YES|NO> REASON=<one short sentence>. Anything else returns (None, "judge output unparseable: …") and counts as failure.

Sequence

sequenceDiagram
participant E as eva eval
participant M as meta.yml
participant J as judge model
E->>M: read description, triggers, not_for
loop each should_match query
E->>J: wrapper + DESCRIPTION + TRIGGERS + USER QUERY
J-->>E: DECISION=YES|NO REASON=…
E->>E: PASS iff YES
end
loop each should_not_match query
E->>J: same wrapper, different query
J-->>E: DECISION=YES|NO REASON=…
E->>E: PASS iff NO
end
E->>E: include in passed/total + .eval.jsonl

Skip Conditions

The triggering pass is skipped entirely when --case is set (single-case mode), and when the block is absent or both lists are empty (bin/eva:738).

Key Terms

  • trigger judge — secondary claude invocation evaluating description-shape alone, not prompt output.
  • DECISION=YES — the judge predicts the skill auto-loads for this query; required for should_match, forbidden for should_not_match.
  • default judge model — falls through triggering.judge.modeldefaults.model → unset.

Q&A

Q: Which two list keys does the triggering block accept? A: should_match and should_not_match; doctor requires at least one of them when the block is present (bin/eva:404-405).

Q: What format does the trigger judge return? A: A single line DECISION=<YES|NO> REASON=<sentence> parsed by re.search(r"DECISION=(YES|NO)", stdout) (bin/eva:689).

Q: What gets sent to the judge alongside the user query? A: The meta.description (or summary if missing), bulleted meta.triggers, bulleted meta.not_for, all wrapped in a fixed instruction telling the judge to decide based on description and triggers ONLY (bin/eva:668-685).

Examples

A negative trigger that prevents the refactor prompt from grabbing weather queries:

triggering:
should_not_match:
- "what's the weather in San Francisco"

neighbors on the map