CRUMB a card from devarno-cloud

FNP Observability & Prometheus Metrics

fnp intermediate 6 min read

ELI5

Observability is like a car dashboard: you can see speed (latency), fuel level (CPU), temperature (memory), and warning lights (errors). Prometheus is the tool that collects these metrics from FNP every 15 seconds. Grafana displays them on dashboards. Jaeger traces individual requests.

Technical Deep Dive

Metrics Exported (50+)

CategoryMetricTypeExample
Operationsfnp_insert_latencyHistogramP50=50ms, P99=200ms
Operationsfnp_delete_countCounter1,234,567 total
Cryptographykyber_encapsulate_latencyHistogramP50=2ms
Cryptographym2ore_compare_latencyHistogramP50=0.5ms
Cryptographyhalo2_proof_verify_latencyHistogramP50=5ms, P99=15ms
Replicationcrdt_merge_latencyHistogramP50=1ms
Networkpeer_replication_lagGauge~5 seconds
Databasepostgres_connection_pool_availableGauge8/10 available
Errorsfnp_operation_rejected_totalCounter42 rejected (invalid proofs)

Histogram Percentiles

Prometheus histograms track buckets (quantiles):

fnp_insert_latency_seconds_bucket{le="0.01"} = 100 # ≤10ms
fnp_insert_latency_seconds_bucket{le="0.05"} = 450 # ≤50ms
fnp_insert_latency_seconds_bucket{le="0.2"} = 980 # ≤200ms
fnp_insert_latency_seconds_bucket{le="+Inf"} = 1000 # Total
Computed:
P50 (median) = 50ms (450 / 1000 = 45th percentile)
P99 = 200ms (99% ≤ 200ms)

Grafana Dashboards (12+)

FNP Operations Dashboard:

  • Row 1: insert/delete latency (P50, P95, P99)
  • Row 2: operation success/reject rate
  • Row 3: CRDT merge latency
  • Row 4: Network replication lag

Cryptography Dashboard:

  • Kyber encapsulation timeline
  • M²-ORE comparison timing
  • Halo2 proof verification
  • Dilithium signature verification

Infrastructure Dashboard:

  • Pod count (target vs actual)
  • CPU/memory utilization
  • Disk I/O
  • Network bandwidth

Jaeger Distributed Tracing

Traces individual operations end-to-end:

flowchart TD
ROOT["GET /api/fnp/insert\ntrace_id=abc123\n──────────────────\ntotal: 65ms · 7 spans"]
ROOT --> LSEQ["LSEQ position allocation\n1ms"]
LSEQ --> KDF["KDF randomness generation"]
ROOT --> ORE["M²-ORE encryption\n3ms"]
ORE --> NTT["Polynomial multiplication (NTT)"]
ROOT --> KYB["Kyber encapsulation\n8ms"]
KYB --> LWE["Module-LWE sampling"]
ROOT --> HALO["Halo2 proof generation\n50ms"]
HALO --> EVAL["Constraint evaluation"]
HALO --> IPA["IPA prover"]
ROOT --> DIL["Dilithium signing\n2ms"]
DIL --> REJ["Rejection sampling"]
ROOT --> MERGE["Server merge\n1ms"]

The root span fans out in parallel to all six child operations to reflect how Jaeger actually captures them — as concurrent sub-spans within one trace, not as a sequential chain. The 50ms Halo2 node stands out visually as the dominant cost: its two children (constraint evaluation and IPA prover) account for roughly 77% of the entire request budget, making it the obvious optimisation target.

Alerting Rules

High latency alert:

alert: FNPHighInsertLatency
expr: histogram_quantile(0.99, fnp_insert_latency) > 500ms
for: 5m
action: page on-call engineer

Error spike alert:

alert: FNPHighErrorRate
expr: rate(fnp_operation_rejected[5m]) > 0.05
for: 2m
action: page on-call engineer

SLA violation alert:

alert: FNPSLAViolation
expr: (fnp_insert_latency_p99 > 200ms) OR (fnp_error_rate > 0.1%)
for: 15m
action: critical page (SLA breach)

Key Terms

  • Histogram → Metric type tracking distribution (P50, P95, P99)
  • Gauge → Instantaneous value (current CPU %)
  • Counter → Monotonically increasing (total operations)
  • Span → Single operation in a trace
  • SLO (Service Level Objective) → Target: P99 latency < 200ms, error rate < 0.1%

Q&A

Q: How often does Prometheus scrape? A: Every 15 seconds (configurable). FNP uses 15s interval — balance between freshness and storage cost.

Q: Can tracing slow down the system? A: Yes, ~5% overhead per trace. FNP samples 1% of requests (99% no trace overhead), but high-error requests are always traced.

Q: What’s the retention for Prometheus data? A: 15 days (older data deleted). Grafana stores downsampled long-term data (monthly aggregates) for 1 year.

Examples

Observability is like a hospital monitor: vital signs (metrics) update every 15 seconds, alerts trigger if heart rate is too high (latency spike), and detailed charts (dashboards) show the patient’s condition over hours/days.

neighbors on the map