CRUMB a card from devarno-cloud

Engine Metrics Catalogue

nestr intermediate 6 min read

ELI5

The Engine has two dashboards bolted to its dashboard: nestr_* is the speedometer (per-request and per-pellet timings), and orchestrator_* is the fleet log (per-service operation counts). Both feed the same Prometheus scrape, but they answer different questions.

Technical Deep Dive

Two namespaces, two files:

  • engine/internal/adapters/metrics.gonestr_*: HTTP and pellet primitives.
  • engine/internal/pkg/metrics.goorchestrator_*: cross-service operation telemetry, richer labels.

Metric Inventory

classDiagram
class nestr_metrics {
Counter operations_total
Summary request_duration_seconds quantiles
CounterVec requests_total by_status
Counter cache_hits_total
Counter cache_misses_total
Histogram pellet_compress_duration_seconds
Histogram pellet_extract_duration_seconds
Counter pellet_compress_total
Counter pellet_extract_total
Gauge cache_hit_ratio
Gauge cache_size_bytes
}
class orchestrator_metrics {
CounterVec operations_total by_service_environment_version_operation_status
HistogramVec operation_duration_seconds labelled
GaugeVec operation_status by_operation_id
Gauge active_operations
CounterVec repo_operations_total by_repo_operation_status
HistogramVec assembly_duration_seconds buckets
CounterVec sync_operations_total by_sync_type_target_status
CounterVec errors_total by_error_type_component
CounterVec cache_hits_total by_cache_type_result
Gauge active_workspaces
}

Recording Hooks

Code pathMetric written
loggingMiddleware (rest.go)nestr_request_duration_seconds, nestr_requests_total{status}
handleCompressPelletnestr_pellet_compress_duration_seconds, nestr_pellet_compress_total
handleExtractPelletnestr_pellet_extract_duration_seconds, nestr_pellet_extract_total
pellet store hit/missnestr_cache_hits_total / nestr_cache_misses_total → ratio recomputed into nestr_cache_hit_ratio gauge
assembly workfloworchestrator_assembly_duration_seconds, orchestrator_active_operations

Histogram Buckets

  • Pellet compress/extract: exponential, base 0.1, factor 2, count 10 → covers 0.1 s to ~50 s.
  • Assembly duration: explicit [1, 5, 10, 30, 60, 120, 300, 600] seconds — sized for full-workspace operations.

request_duration_seconds is a Summary

nestr_request_duration_seconds is a Prometheus Summary with quantiles {0.5, 0.95, 0.99}. Summaries cannot be aggregated across instances in PromQL — if the Engine is ever horizontally scaled, this metric should migrate to a Histogram. The orchestrator namespace already uses Histograms for that reason.

Key Terms

  • Summary → client-side quantile estimator; cheap per instance, not aggregable across instances.
  • Histogram → bucketed counts; aggregable via histogram_quantile() across replicas.
  • Label cardinality → orchestrator_metrics uses high-cardinality labels (operation_id); reserve those for gauges that drop quickly, not counters that accumulate forever.

Q&A

Q: Why does orchestrator_operation_status carry operation_id while the counters don’t? A: It is a gauge that toggles per operation lifetime; cardinality is bounded by concurrency. Counters with operation_id would explode the TSDB.

Q: How is nestr_cache_hit_ratio kept in sync with the counters? A: The store recomputes it on every hit/miss path and Set()s the gauge. Counters and gauge can briefly disagree under contention but converge after each access.

Q: What’s the practical difference for an operator if a metric is in nestr_ vs orchestrator_? A: nestr_* answers “how is this single Engine doing”; orchestrator_* answers “how are operations across services and environments doing”. Different dashboards, different alerts.

Examples

Top-99p compress time over a 5-minute window: histogram_quantile(0.99, sum(rate(nestr_pellet_compress_duration_seconds_bucket[5m])) by (le)). Bridge federates the same series outward (see nestr-009).

neighbors on the map