CRUMB a card from devarno-cloud

.usage.jsonl Append Format

eva beginner 4 min read

ELI5

Every kick run drops one line — a single JSON sticker — into the prompt’s logbook. The sticker says when, with what inputs (hashed), whether it was sent, what the verifier thought, and how big and fast the round-trip was. Stickers never get edited; they just stack.

Technical Deep Dive

Schema (bin/kick:83-98)

FieldTypeNotes
tsISO-8601 UTCiso_now; appended at end of run
casestring | nullThe --case name, or null
vars_hash12-char hexsha256 of sorted KEY=VALUE lines; literal "none" if no vars
sentbooltrue for --send, false for dry-render
exit_codeintclaude’s exit (PIPESTATUS[1]); 0 for dry-render
verifiedbool | nullfrom verify.sh; null when no verifier
duration_msintms_now end − start
prompt_wordsintawk NF-tokenised count of rendered prompt
output_wordsintawk NF-count of claude stdout (0 for dry-render)

Class Diagram

classDiagram
class UsageRow {
+ts : iso8601
+case : string?
+vars_hash : hex12
+sent : bool
+exit_code : int
+verified : bool?
+duration_ms : int
+prompt_words : int
+output_words : int
}
class GateConsumer { +draft_to_tested(); +tested_to_ready() }
class PerfSummary { +median_duration_ms; +median_prompt_words; +median_output_words }
GateConsumer --> UsageRow : reads sent + verified
PerfSummary --> UsageRow : medians over sent rows

Consumers

  • usage_summary (bin/eva:132-161): counts total, sent, verified_true|false|null, last ts, and median of duration_ms/prompt_words/output_words across sent rows. Surfaced by eva show.
  • cmd_log (bin/eva:260-277): tail-prints recent rows.
  • cmd_promote (bin/eva:840-866): the gate counter — splits rows on sent and verified for the lifecycle thresholds in eva-003.

A dry-render (no --send) still appends a row with sent: false, so eva show and eva log reflect template iteration too — but those rows do not count toward any promotion gate.

Key Terms

  • vars_hash — first 12 hex chars of sha256 over the sorted KEY=VALUE lines; gives a stable signature for “same inputs, different run”.
  • PIPESTATUS[1] — bash idiom; here used to capture claude’s exit code through the tee pipe (bin/kick:196-197).
  • sent row — the universe used by every aggregator that cares about real model calls.

Q&A

Q: What is vars_hash and how is it computed? A: First 12 hex chars of sha256sum over the lines from sorting the VARS array (bin/kick:71-78). Empty VARS produces the literal string "none".

Q: Which fields feed eva show’s perf medians? A: duration_ms, prompt_words, output_words — medians taken across rows where sent: true (bin/eva:140-160).

Q: Why does a dry run still get a row in .usage.jsonl? A: kick always calls append_usage at the bottom of the non-send branch (bin/kick:213) so authors can see iteration history. The row carries sent: false, exit_code: 0, verified: null, output_words: 0, which keeps it out of every gate counter.

Examples

Sample row:

{"ts":"2026-05-05T09:31:00Z","case":"happy-path","vars_hash":"a3f0…b1","sent":true,"exit_code":0,"verified":true,"duration_ms":18342,"prompt_words":612,"output_words":1483}

neighbors on the map