Engine Metrics Catalogue
nestr intermediate 6 min read
ELI5
The Engine has two dashboards bolted to its dashboard: nestr_* is the speedometer (per-request and per-pellet timings), and orchestrator_* is the fleet log (per-service operation counts). Both feed the same Prometheus scrape, but they answer different questions.
Technical Deep Dive
Two namespaces, two files:
engine/internal/adapters/metrics.go→nestr_*: HTTP and pellet primitives.engine/internal/pkg/metrics.go→orchestrator_*: cross-service operation telemetry, richer labels.
Metric Inventory
classDiagram class nestr_metrics { Counter operations_total Summary request_duration_seconds quantiles CounterVec requests_total by_status Counter cache_hits_total Counter cache_misses_total Histogram pellet_compress_duration_seconds Histogram pellet_extract_duration_seconds Counter pellet_compress_total Counter pellet_extract_total Gauge cache_hit_ratio Gauge cache_size_bytes } class orchestrator_metrics { CounterVec operations_total by_service_environment_version_operation_status HistogramVec operation_duration_seconds labelled GaugeVec operation_status by_operation_id Gauge active_operations CounterVec repo_operations_total by_repo_operation_status HistogramVec assembly_duration_seconds buckets CounterVec sync_operations_total by_sync_type_target_status CounterVec errors_total by_error_type_component CounterVec cache_hits_total by_cache_type_result Gauge active_workspaces }Recording Hooks
| Code path | Metric written |
|---|---|
loggingMiddleware (rest.go) | nestr_request_duration_seconds, nestr_requests_total{status} |
handleCompressPellet | nestr_pellet_compress_duration_seconds, nestr_pellet_compress_total |
handleExtractPellet | nestr_pellet_extract_duration_seconds, nestr_pellet_extract_total |
| pellet store hit/miss | nestr_cache_hits_total / nestr_cache_misses_total → ratio recomputed into nestr_cache_hit_ratio gauge |
| assembly workflow | orchestrator_assembly_duration_seconds, orchestrator_active_operations |
Histogram Buckets
- Pellet compress/extract: exponential, base 0.1, factor 2, count 10 → covers 0.1 s to ~50 s.
- Assembly duration: explicit
[1, 5, 10, 30, 60, 120, 300, 600]seconds — sized for full-workspace operations.
request_duration_seconds is a Summary
nestr_request_duration_seconds is a Prometheus Summary with quantiles {0.5, 0.95, 0.99}. Summaries cannot be aggregated across instances in PromQL — if the Engine is ever horizontally scaled, this metric should migrate to a Histogram. The orchestrator namespace already uses Histograms for that reason.
Key Terms
- Summary → client-side quantile estimator; cheap per instance, not aggregable across instances.
- Histogram → bucketed counts; aggregable via
histogram_quantile()across replicas. - Label cardinality → orchestrator_metrics uses high-cardinality labels (
operation_id); reserve those for gauges that drop quickly, not counters that accumulate forever.
Q&A
Q: Why does orchestrator_operation_status carry operation_id while the counters don’t?
A: It is a gauge that toggles per operation lifetime; cardinality is bounded by concurrency. Counters with operation_id would explode the TSDB.
Q: How is nestr_cache_hit_ratio kept in sync with the counters?
A: The store recomputes it on every hit/miss path and Set()s the gauge. Counters and gauge can briefly disagree under contention but converge after each access.
Q: What’s the practical difference for an operator if a metric is in nestr_ vs orchestrator_?
A: nestr_* answers “how is this single Engine doing”; orchestrator_* answers “how are operations across services and environments doing”. Different dashboards, different alerts.
Examples
Top-99p compress time over a 5-minute window: histogram_quantile(0.99, sum(rate(nestr_pellet_compress_duration_seconds_bucket[5m])) by (le)). Bridge federates the same series outward (see nestr-009).
neighbors on the map
- FNP Observability & Prometheus Metrics monitoring FNP systems
- Deployment Topology & Proxy Conflict Resolution setting up a new environment (kitten/cat/lion)
- Run Outcome Classification interpreting a History row's status pill