CRUMB a card from devarno-cloud

Quality Scoring & Regression Detection

grace intermediate 5 min read

ELI5

Every run gets a 0-to-1 grade. Three teachers vote with different weights: did the output match the contract, was anything missing, and was the prompt-to-output ratio reasonable. Drop more than 0.05 between versions and an automatic referee (Veritas) is summoned.

Technical Deep Dive

Source: automation/trace-protocol.md, @stratt/cli’s lib/regression.ts (detectRegressions), lib/veritas.ts (VeritasGate).

Score Formula

quality_score = (conformance × 0.40) + (completeness × 0.35) + (efficiency × 0.25)
FactorWeightMeasurement
Contract conformance0.40All outputs match declared types and constraints (boolean, treated as 1.0/0.0)
Completeness0.35Fraction of required outputs present and non-empty
Token efficiency0.25Useful-output-tokens / total-tokens ratio

Thresholds

ScoreBandAction
≥ 0.80GoodNone
0.60–0.79ReviewFlag for manual inspection
< 0.60PoorTrigger refinement cycle
Δ > 0.05 vs prior versionRegressionTrigger Veritas gate

Detection & Veritas Pipeline

flowchart LR
A[New chain version published] --> B[3 live runs + N dry-runs]
B --> C[per-run quality_score]
C --> D[mean current vs mean prior]
D --> E{delta > 0.05?}
E -->|no| F[publish stays live]
E -->|yes| G[Veritas gate triggered]
G --> H{authority decision}
H -->|APPROVED| I[publish retained]
H -->|REJECTED| J[rollback to prior version]
H -->|TIMEOUT| K[chain halted; no auto-approve]
style E fill:#fef3c7
style G fill:#fecaca

Pluggability

The default scorer is heuristic; the interface is pluggable. A future replacement can call into a learned model without changing the trace shape (quality_indicators retains the three factor inputs alongside the composite score so retro-rescoring works).

Weekly Evaluation Cycle

Per automation/trace-protocol.md, Monday memory maintenance:

  1. Read all traces from the past week.
  2. Compute per-unit and per-chain mean scores.
  3. Compare against previous week.
  4. Flag underperformers (< 0.60 or Δ > 0.05).
  5. Refine: prompt body, agent assignment, step ordering, gate threshold.
  6. Log summary to memory/YYYY-MM-DD.md.
  7. Re-run stratt validate + stratt fingerprint --verify after any unit change.

CLI Surface

  • stratt regression — invokes detectRegressions.
  • stratt gate veritas --dry-run — runs the live 3-run + dry-run evaluation without committing the gate decision.
  • stratt export-dspy --min-score 0.7 — exports JSONL filtered to passing runs for MIPROv2 ingestion.

Key Terms

  • Veritas gate → SPEC-05 gate fired when quality regression exceeds Δ 0.05; distinct from SPEC-04 human-approval gates.
  • Refinement cycle → Author-driven adjustment to prompt body, agent assignment, or gate thresholds when a unit scores < 0.60.
  • MIPROv2 → DSPy optimiser; future automation target for prompt parameter tuning fed by JSONL exports.

Q&A

Q: What three factors compose the default quality score? A: Contract conformance (0.40), completeness (0.35), token efficiency (0.25). Weights live in the heuristic scorer and are pluggable.

Q: What delta triggers the Veritas gate? A: Mean quality score delta greater than 0.05 between adjacent versions, computed across the live 3-run sample.

Q: Is regression detection automated end-to-end today? A: Detection is implemented (regression.ts); event-driven Veritas auto-trigger automation is open work per GRACE-PROTOCOL-OVERVIEW.md “What Remains”.

Examples

dev/chain/sol-1-boot historically scores 0.84 mean. After bumping to 1.1.0, three live runs score 0.78, 0.75, 0.79 (mean 0.77). Δ = 0.07 > 0.05. detectRegressions returns a hit; Veritas gate is enqueued; rollback to 1.0.x is the candidate action if LEWIS-06 rejects.

neighbors on the map