Quality Scoring & Regression Detection

grace intermediate 5 min read

ELI5

Every run gets a 0-to-1 grade. Three teachers vote with different weights: did the output match the contract, was anything missing, and was the prompt-to-output ratio reasonable. Drop more than 0.05 between versions and an automatic referee (Veritas) is summoned.

Technical Deep Dive

Source: automation/trace-protocol.md, @stratt/cli’s lib/regression.ts (detectRegressions), lib/veritas.ts (VeritasGate).

Score Formula

quality_score = (conformance × 0.40) + (completeness × 0.35) + (efficiency × 0.25)

Factor	Weight	Measurement
Contract conformance	0.40	All outputs match declared types and constraints (boolean, treated as 1.0/0.0)
Completeness	0.35	Fraction of required outputs present and non-empty
Token efficiency	0.25	Useful-output-tokens / total-tokens ratio

Thresholds

Score	Band	Action
≥ 0.80	Good	None
0.60–0.79	Review	Flag for manual inspection
< 0.60	Poor	Trigger refinement cycle
Δ > 0.05 vs prior version	Regression	Trigger Veritas gate

Detection & Veritas Pipeline

flowchart LR
    A[New chain version published] --> B[3 live runs + N dry-runs]
    B --> C[per-run quality_score]
    C --> D[mean current vs mean prior]
    D --> E{delta > 0.05?}
    E -->|no| F[publish stays live]
    E -->|yes| G[Veritas gate triggered]
    G --> H{authority decision}
    H -->|APPROVED| I[publish retained]
    H -->|REJECTED| J[rollback to prior version]
    H -->|TIMEOUT| K[chain halted; no auto-approve]
    style E fill:#fef3c7
    style G fill:#fecaca

Pluggability

The default scorer is heuristic; the interface is pluggable. A future replacement can call into a learned model without changing the trace shape (quality_indicators retains the three factor inputs alongside the composite score so retro-rescoring works).

Weekly Evaluation Cycle

Per automation/trace-protocol.md, Monday memory maintenance:

Read all traces from the past week.
Compute per-unit and per-chain mean scores.
Compare against previous week.
Flag underperformers (< 0.60 or Δ > 0.05).
Refine: prompt body, agent assignment, step ordering, gate threshold.
Log summary to memory/YYYY-MM-DD.md.
Re-run stratt validate + stratt fingerprint --verify after any unit change.

CLI Surface

stratt regression — invokes detectRegressions.
stratt gate veritas --dry-run — runs the live 3-run + dry-run evaluation without committing the gate decision.
stratt export-dspy --min-score 0.7 — exports JSONL filtered to passing runs for MIPROv2 ingestion.

Key Terms

Veritas gate → SPEC-05 gate fired when quality regression exceeds Δ 0.05; distinct from SPEC-04 human-approval gates.
Refinement cycle → Author-driven adjustment to prompt body, agent assignment, or gate thresholds when a unit scores < 0.60.
MIPROv2 → DSPy optimiser; future automation target for prompt parameter tuning fed by JSONL exports.

Q&A

Q: What three factors compose the default quality score? A: Contract conformance (0.40), completeness (0.35), token efficiency (0.25). Weights live in the heuristic scorer and are pluggable.

Q: What delta triggers the Veritas gate? A: Mean quality score delta greater than 0.05 between adjacent versions, computed across the live 3-run sample.

Q: Is regression detection automated end-to-end today? A: Detection is implemented (regression.ts); event-driven Veritas auto-trigger automation is open work per GRACE-PROTOCOL-OVERVIEW.md “What Remains”.

Examples

dev/chain/sol-1-boot historically scores 0.84 mean. After bumping to 1.1.0, three live runs score 0.78, 0.75, 0.79 (mean 0.77). Δ = 0.07 > 0.05. detectRegressions returns a hit; Veritas gate is enqueued; rollback to 1.0.x is the candidate action if LEWIS-06 rejects.

neighbors on the map

Gate Checkpoint Protocol designing a gated chain step
Execution Trace Format wiring a custom executor to emit SPEC-05-compatible traces