FNP Observability & Prometheus Metrics

fnp intermediate 6 min read

ELI5

Observability is like a car dashboard: you can see speed (latency), fuel level (CPU), temperature (memory), and warning lights (errors). Prometheus is the tool that collects these metrics from FNP every 15 seconds. Grafana displays them on dashboards. Jaeger traces individual requests.

Technical Deep Dive

Metrics Exported (50+)

Category	Metric	Type	Example
Operations	`fnp_insert_latency`	Histogram	P50=50ms, P99=200ms
Operations	`fnp_delete_count`	Counter	1,234,567 total
Cryptography	`kyber_encapsulate_latency`	Histogram	P50=2ms
Cryptography	`m2ore_compare_latency`	Histogram	P50=0.5ms
Cryptography	`halo2_proof_verify_latency`	Histogram	P50=5ms, P99=15ms
Replication	`crdt_merge_latency`	Histogram	P50=1ms
Network	`peer_replication_lag`	Gauge	~5 seconds
Database	`postgres_connection_pool_available`	Gauge	8/10 available
Errors	`fnp_operation_rejected_total`	Counter	42 rejected (invalid proofs)

Histogram Percentiles

Prometheus histograms track buckets (quantiles):

fnp_insert_latency_seconds_bucket{le="0.01"} = 100    # ≤10ms
fnp_insert_latency_seconds_bucket{le="0.05"} = 450    # ≤50ms
fnp_insert_latency_seconds_bucket{le="0.2"} = 980     # ≤200ms
fnp_insert_latency_seconds_bucket{le="+Inf"} = 1000   # Total

Computed:
P50 (median) = 50ms  (450 / 1000 = 45th percentile)
P99 = 200ms  (99% ≤ 200ms)

Grafana Dashboards (12+)

FNP Operations Dashboard:

Row 1: insert/delete latency (P50, P95, P99)
Row 2: operation success/reject rate
Row 3: CRDT merge latency
Row 4: Network replication lag

Cryptography Dashboard:

Kyber encapsulation timeline
M²-ORE comparison timing
Halo2 proof verification
Dilithium signature verification

Infrastructure Dashboard:

Pod count (target vs actual)
CPU/memory utilization
Disk I/O
Network bandwidth

Jaeger Distributed Tracing

Traces individual operations end-to-end:

flowchart TD
    ROOT["GET /api/fnp/insert\ntrace_id=abc123\n──────────────────\ntotal: 65ms · 7 spans"]

    ROOT --> LSEQ["LSEQ position allocation\n1ms"]
    LSEQ --> KDF["KDF randomness generation"]

    ROOT --> ORE["M²-ORE encryption\n3ms"]
    ORE --> NTT["Polynomial multiplication (NTT)"]

    ROOT --> KYB["Kyber encapsulation\n8ms"]
    KYB --> LWE["Module-LWE sampling"]

    ROOT --> HALO["Halo2 proof generation\n50ms"]
    HALO --> EVAL["Constraint evaluation"]
    HALO --> IPA["IPA prover"]

    ROOT --> DIL["Dilithium signing\n2ms"]
    DIL --> REJ["Rejection sampling"]

    ROOT --> MERGE["Server merge\n1ms"]

The root span fans out in parallel to all six child operations to reflect how Jaeger actually captures them — as concurrent sub-spans within one trace, not as a sequential chain. The 50ms Halo2 node stands out visually as the dominant cost: its two children (constraint evaluation and IPA prover) account for roughly 77% of the entire request budget, making it the obvious optimisation target.

Alerting Rules

High latency alert:

alert: FNPHighInsertLatency
expr: histogram_quantile(0.99, fnp_insert_latency) > 500ms
for: 5m
action: page on-call engineer

Error spike alert:

alert: FNPHighErrorRate
expr: rate(fnp_operation_rejected[5m]) > 0.05
for: 2m
action: page on-call engineer

SLA violation alert:

alert: FNPSLAViolation
expr: (fnp_insert_latency_p99 > 200ms) OR (fnp_error_rate > 0.1%)
for: 15m
action: critical page (SLA breach)

Key Terms

Histogram → Metric type tracking distribution (P50, P95, P99)
Gauge → Instantaneous value (current CPU %)
Counter → Monotonically increasing (total operations)
Span → Single operation in a trace
SLO (Service Level Objective) → Target: P99 latency < 200ms, error rate < 0.1%

Q&A

Q: How often does Prometheus scrape? A: Every 15 seconds (configurable). FNP uses 15s interval — balance between freshness and storage cost.

Q: Can tracing slow down the system? A: Yes, ~5% overhead per trace. FNP samples 1% of requests (99% no trace overhead), but high-error requests are always traced.

Q: What’s the retention for Prometheus data? A: 15 days (older data deleted). Grafana stores downsampled long-term data (monthly aggregates) for 1 year.

Examples

Observability is like a hospital monitor: vital signs (metrics) update every 15 seconds, alerts trigger if heart rate is too high (latency spike), and detailed charts (dashboards) show the patient’s condition over hours/days.

neighbors on the map

FNP Kubernetes Multi-Region Architecture deploying FNP across multiple regions