FNP Observability & Prometheus Metrics
fnp intermediate 6 min read
ELI5
Observability is like a car dashboard: you can see speed (latency), fuel level (CPU), temperature (memory), and warning lights (errors). Prometheus is the tool that collects these metrics from FNP every 15 seconds. Grafana displays them on dashboards. Jaeger traces individual requests.
Technical Deep Dive
Metrics Exported (50+)
| Category | Metric | Type | Example |
|---|---|---|---|
| Operations | fnp_insert_latency | Histogram | P50=50ms, P99=200ms |
| Operations | fnp_delete_count | Counter | 1,234,567 total |
| Cryptography | kyber_encapsulate_latency | Histogram | P50=2ms |
| Cryptography | m2ore_compare_latency | Histogram | P50=0.5ms |
| Cryptography | halo2_proof_verify_latency | Histogram | P50=5ms, P99=15ms |
| Replication | crdt_merge_latency | Histogram | P50=1ms |
| Network | peer_replication_lag | Gauge | ~5 seconds |
| Database | postgres_connection_pool_available | Gauge | 8/10 available |
| Errors | fnp_operation_rejected_total | Counter | 42 rejected (invalid proofs) |
Histogram Percentiles
Prometheus histograms track buckets (quantiles):
fnp_insert_latency_seconds_bucket{le="0.01"} = 100 # ≤10msfnp_insert_latency_seconds_bucket{le="0.05"} = 450 # ≤50msfnp_insert_latency_seconds_bucket{le="0.2"} = 980 # ≤200msfnp_insert_latency_seconds_bucket{le="+Inf"} = 1000 # Total
Computed:P50 (median) = 50ms (450 / 1000 = 45th percentile)P99 = 200ms (99% ≤ 200ms)Grafana Dashboards (12+)
FNP Operations Dashboard:
- Row 1: insert/delete latency (P50, P95, P99)
- Row 2: operation success/reject rate
- Row 3: CRDT merge latency
- Row 4: Network replication lag
Cryptography Dashboard:
- Kyber encapsulation timeline
- M²-ORE comparison timing
- Halo2 proof verification
- Dilithium signature verification
Infrastructure Dashboard:
- Pod count (target vs actual)
- CPU/memory utilization
- Disk I/O
- Network bandwidth
Jaeger Distributed Tracing
Traces individual operations end-to-end:
flowchart TD ROOT["GET /api/fnp/insert\ntrace_id=abc123\n──────────────────\ntotal: 65ms · 7 spans"]
ROOT --> LSEQ["LSEQ position allocation\n1ms"] LSEQ --> KDF["KDF randomness generation"]
ROOT --> ORE["M²-ORE encryption\n3ms"] ORE --> NTT["Polynomial multiplication (NTT)"]
ROOT --> KYB["Kyber encapsulation\n8ms"] KYB --> LWE["Module-LWE sampling"]
ROOT --> HALO["Halo2 proof generation\n50ms"] HALO --> EVAL["Constraint evaluation"] HALO --> IPA["IPA prover"]
ROOT --> DIL["Dilithium signing\n2ms"] DIL --> REJ["Rejection sampling"]
ROOT --> MERGE["Server merge\n1ms"]The root span fans out in parallel to all six child operations to reflect how Jaeger actually captures them — as concurrent sub-spans within one trace, not as a sequential chain. The 50ms Halo2 node stands out visually as the dominant cost: its two children (constraint evaluation and IPA prover) account for roughly 77% of the entire request budget, making it the obvious optimisation target.
Alerting Rules
High latency alert:
alert: FNPHighInsertLatencyexpr: histogram_quantile(0.99, fnp_insert_latency) > 500msfor: 5maction: page on-call engineerError spike alert:
alert: FNPHighErrorRateexpr: rate(fnp_operation_rejected[5m]) > 0.05for: 2maction: page on-call engineerSLA violation alert:
alert: FNPSLAViolationexpr: (fnp_insert_latency_p99 > 200ms) OR (fnp_error_rate > 0.1%)for: 15maction: critical page (SLA breach)Key Terms
- Histogram → Metric type tracking distribution (P50, P95, P99)
- Gauge → Instantaneous value (current CPU %)
- Counter → Monotonically increasing (total operations)
- Span → Single operation in a trace
- SLO (Service Level Objective) → Target: P99 latency < 200ms, error rate < 0.1%
Q&A
Q: How often does Prometheus scrape? A: Every 15 seconds (configurable). FNP uses 15s interval — balance between freshness and storage cost.
Q: Can tracing slow down the system? A: Yes, ~5% overhead per trace. FNP samples 1% of requests (99% no trace overhead), but high-error requests are always traced.
Q: What’s the retention for Prometheus data? A: 15 days (older data deleted). Grafana stores downsampled long-term data (monthly aggregates) for 1 year.
Examples
Observability is like a hospital monitor: vital signs (metrics) update every 15 seconds, alerts trigger if heart rate is too high (latency spike), and detailed charts (dashboards) show the patient’s condition over hours/days.
neighbors on the map
- FNP Kubernetes Multi-Region Architecture deploying FNP across multiple regions