Quality Scoring & Regression Detection
grace intermediate 5 min read
ELI5
Every run gets a 0-to-1 grade. Three teachers vote with different weights: did the output match the contract, was anything missing, and was the prompt-to-output ratio reasonable. Drop more than 0.05 between versions and an automatic referee (Veritas) is summoned.
Technical Deep Dive
Source: automation/trace-protocol.md, @stratt/cli’s lib/regression.ts (detectRegressions), lib/veritas.ts (VeritasGate).
Score Formula
quality_score = (conformance × 0.40) + (completeness × 0.35) + (efficiency × 0.25)| Factor | Weight | Measurement |
|---|---|---|
| Contract conformance | 0.40 | All outputs match declared types and constraints (boolean, treated as 1.0/0.0) |
| Completeness | 0.35 | Fraction of required outputs present and non-empty |
| Token efficiency | 0.25 | Useful-output-tokens / total-tokens ratio |
Thresholds
| Score | Band | Action |
|---|---|---|
| ≥ 0.80 | Good | None |
| 0.60–0.79 | Review | Flag for manual inspection |
| < 0.60 | Poor | Trigger refinement cycle |
| Δ > 0.05 vs prior version | Regression | Trigger Veritas gate |
Detection & Veritas Pipeline
flowchart LR A[New chain version published] --> B[3 live runs + N dry-runs] B --> C[per-run quality_score] C --> D[mean current vs mean prior] D --> E{delta > 0.05?} E -->|no| F[publish stays live] E -->|yes| G[Veritas gate triggered] G --> H{authority decision} H -->|APPROVED| I[publish retained] H -->|REJECTED| J[rollback to prior version] H -->|TIMEOUT| K[chain halted; no auto-approve] style E fill:#fef3c7 style G fill:#fecacaPluggability
The default scorer is heuristic; the interface is pluggable. A future replacement can call into a learned model without changing the trace shape (quality_indicators retains the three factor inputs alongside the composite score so retro-rescoring works).
Weekly Evaluation Cycle
Per automation/trace-protocol.md, Monday memory maintenance:
- Read all traces from the past week.
- Compute per-unit and per-chain mean scores.
- Compare against previous week.
- Flag underperformers (< 0.60 or Δ > 0.05).
- Refine: prompt body, agent assignment, step ordering, gate threshold.
- Log summary to
memory/YYYY-MM-DD.md. - Re-run
stratt validate+stratt fingerprint --verifyafter any unit change.
CLI Surface
stratt regression— invokesdetectRegressions.stratt gate veritas --dry-run— runs the live 3-run + dry-run evaluation without committing the gate decision.stratt export-dspy --min-score 0.7— exports JSONL filtered to passing runs for MIPROv2 ingestion.
Key Terms
- Veritas gate → SPEC-05 gate fired when quality regression exceeds Δ 0.05; distinct from SPEC-04 human-approval gates.
- Refinement cycle → Author-driven adjustment to prompt body, agent assignment, or gate thresholds when a unit scores < 0.60.
- MIPROv2 → DSPy optimiser; future automation target for prompt parameter tuning fed by JSONL exports.
Q&A
Q: What three factors compose the default quality score? A: Contract conformance (0.40), completeness (0.35), token efficiency (0.25). Weights live in the heuristic scorer and are pluggable.
Q: What delta triggers the Veritas gate? A: Mean quality score delta greater than 0.05 between adjacent versions, computed across the live 3-run sample.
Q: Is regression detection automated end-to-end today?
A: Detection is implemented (regression.ts); event-driven Veritas auto-trigger automation is open work per GRACE-PROTOCOL-OVERVIEW.md “What Remains”.
Examples
dev/chain/sol-1-boot historically scores 0.84 mean. After bumping to 1.1.0, three live runs score 0.78, 0.75, 0.79 (mean 0.77). Δ = 0.07 > 0.05. detectRegressions returns a hit; Veritas gate is enqueued; rollback to 1.0.x is the candidate action if LEWIS-06 rejects.
neighbors on the map
- Gate Checkpoint Protocol designing a gated chain step
- Execution Trace Format wiring a custom executor to emit SPEC-05-compatible traces