Triggering Tests for Skill Auto-Load

eva intermediate 4 min read

ELI5

A separate small judge reads the prompt’s description card — not the prompt itself — and decides whether a sample user request would trigger it. You list the requests you want it to grab (should_match) and the ones you want it to ignore (should_not_match); a YES/NO that disagrees fails the eval.

Technical Deep Dive

Block Shape (`bin/eva:386-405`, `734-760`)

triggering:
  judge:
    model: claude-haiku-4-5-20251001    # falls back to defaults.model
  should_match:
    - "refactor src/foo.py to remove the global state"
  should_not_match:
    - "explain what this file does"

eva doctor requires the block to be a mapping and to contain at least one of should_match / should_not_match; both lists must be non-empty strings.

Judge Prompt Construction (`bin/eva:668-693`)

The judge receives a fixed wrapper plus three composed sections:

DESCRIPTION: — meta.description (or meta.summary as fallback).
POSITIVE TRIGGERS: — bulleted meta.triggers.
NEGATIVE TRIGGERS (do NOT use for): — bulleted meta.not_for.
USER QUERY: — the test string.

It must reply DECISION=<YES|NO> REASON=<one short sentence>. Anything else returns (None, "judge output unparseable: …") and counts as failure.

Sequence

sequenceDiagram
  participant E as eva eval
  participant M as meta.yml
  participant J as judge model
  E->>M: read description, triggers, not_for
  loop each should_match query
    E->>J: wrapper + DESCRIPTION + TRIGGERS + USER QUERY
    J-->>E: DECISION=YES|NO REASON=…
    E->>E: PASS iff YES
  end
  loop each should_not_match query
    E->>J: same wrapper, different query
    J-->>E: DECISION=YES|NO REASON=…
    E->>E: PASS iff NO
  end
  E->>E: include in passed/total + .eval.jsonl

Skip Conditions

The triggering pass is skipped entirely when --case is set (single-case mode), and when the block is absent or both lists are empty (bin/eva:738).

Key Terms

trigger judge — secondary claude invocation evaluating description-shape alone, not prompt output.
DECISION=YES — the judge predicts the skill auto-loads for this query; required for should_match, forbidden for should_not_match.
default judge model — falls through triggering.judge.model → defaults.model → unset.

Q&A

Q: Which two list keys does the triggering block accept? A: should_match and should_not_match; doctor requires at least one of them when the block is present (bin/eva:404-405).

Q: What format does the trigger judge return? A: A single line DECISION=<YES|NO> REASON=<sentence> parsed by re.search(r"DECISION=(YES|NO)", stdout) (bin/eva:689).

Q: What gets sent to the judge alongside the user query? A: The meta.description (or summary if missing), bulleted meta.triggers, bulleted meta.not_for, all wrapped in a fixed instruction telling the judge to decide based on description and triggers ONLY (bin/eva:668-685).

Examples

A negative trigger that prevents the refactor prompt from grabbing weather queries:

triggering:
  should_not_match:
    - "what's the weather in San Francisco"

neighbors on the map

eva eval Cases, Assertions & Judges adding a new case to eval.yml
Skill Export Pipeline exporting a ready prompt as an Anthropic skill
NFT-Style Capability Token System designing authorization for cross-system sprite access
Airlock Cross-Apex JWT Handoff debugging users who land on stratt.dev unauthenticated despite an airlock session