Engine Metrics Catalogue

nestr intermediate 6 min read

ELI5

The Engine has two dashboards bolted to its dashboard: nestr_* is the speedometer (per-request and per-pellet timings), and orchestrator_* is the fleet log (per-service operation counts). Both feed the same Prometheus scrape, but they answer different questions.

Technical Deep Dive

Two namespaces, two files:

engine/internal/adapters/metrics.go → nestr_*: HTTP and pellet primitives.
engine/internal/pkg/metrics.go → orchestrator_*: cross-service operation telemetry, richer labels.

Metric Inventory

classDiagram
    class nestr_metrics {
        Counter operations_total
        Summary request_duration_seconds quantiles
        CounterVec requests_total by_status
        Counter cache_hits_total
        Counter cache_misses_total
        Histogram pellet_compress_duration_seconds
        Histogram pellet_extract_duration_seconds
        Counter pellet_compress_total
        Counter pellet_extract_total
        Gauge cache_hit_ratio
        Gauge cache_size_bytes
    }
    class orchestrator_metrics {
        CounterVec operations_total by_service_environment_version_operation_status
        HistogramVec operation_duration_seconds labelled
        GaugeVec operation_status by_operation_id
        Gauge active_operations
        CounterVec repo_operations_total by_repo_operation_status
        HistogramVec assembly_duration_seconds buckets
        CounterVec sync_operations_total by_sync_type_target_status
        CounterVec errors_total by_error_type_component
        CounterVec cache_hits_total by_cache_type_result
        Gauge active_workspaces
    }

Recording Hooks

Code path	Metric written
`loggingMiddleware` (`rest.go`)	`nestr_request_duration_seconds`, `nestr_requests_total{status}`
`handleCompressPellet`	`nestr_pellet_compress_duration_seconds`, `nestr_pellet_compress_total`
`handleExtractPellet`	`nestr_pellet_extract_duration_seconds`, `nestr_pellet_extract_total`
pellet store hit/miss	`nestr_cache_hits_total` / `nestr_cache_misses_total` → ratio recomputed into `nestr_cache_hit_ratio` gauge
assembly workflow	`orchestrator_assembly_duration_seconds`, `orchestrator_active_operations`

Histogram Buckets

Pellet compress/extract: exponential, base 0.1, factor 2, count 10 → covers 0.1 s to ~50 s.
Assembly duration: explicit [1, 5, 10, 30, 60, 120, 300, 600] seconds — sized for full-workspace operations.

request_duration_seconds is a Summary

nestr_request_duration_seconds is a Prometheus Summary with quantiles {0.5, 0.95, 0.99}. Summaries cannot be aggregated across instances in PromQL — if the Engine is ever horizontally scaled, this metric should migrate to a Histogram. The orchestrator namespace already uses Histograms for that reason.

Key Terms

Summary → client-side quantile estimator; cheap per instance, not aggregable across instances.
Histogram → bucketed counts; aggregable via histogram_quantile() across replicas.
Label cardinality → orchestrator_metrics uses high-cardinality labels (operation_id); reserve those for gauges that drop quickly, not counters that accumulate forever.

Q&A

Q: Why does orchestrator_operation_status carry operation_id while the counters don’t? A: It is a gauge that toggles per operation lifetime; cardinality is bounded by concurrency. Counters with operation_id would explode the TSDB.

Q: How is nestr_cache_hit_ratio kept in sync with the counters? A: The store recomputes it on every hit/miss path and Set()s the gauge. Counters and gauge can briefly disagree under contention but converge after each access.

Q: What’s the practical difference for an operator if a metric is in nestr_ vs orchestrator_? A: nestr_* answers “how is this single Engine doing”; orchestrator_* answers “how are operations across services and environments doing”. Different dashboards, different alerts.

Examples

Top-99p compress time over a 5-minute window: histogram_quantile(0.99, sum(rate(nestr_pellet_compress_duration_seconds_bucket[5m])) by (le)). Bridge federates the same series outward (see nestr-009).

neighbors on the map

FNP Observability & Prometheus Metrics monitoring FNP systems
Deployment Topology & Proxy Conflict Resolution setting up a new environment (kitten/cat/lion)
Run Outcome Classification interpreting a History row's status pill