Perch Bridge Federation
nestr intermediate 4 min read
ELI5
The Bridge is a delivery courier: every 30 seconds it visits the Engine’s metrics shop, picks up everything on the shelves, drops the haul at the Folio warehouse, and writes a postcard saying “made it, here’s how heavy the load was”.
Technical Deep Dive
perch/bridge/main.go is a small Go service whose job is one-way metric federation from the Engine into a remote /api/v1/write-compatible collector (“Folio”).
Configuration
| Env / flag | Default | Purpose |
|---|---|---|
NestrMetricsURL | http://nestr-engine:9090/metrics | Source of nestr_* series |
FolioURL | (required) | Destination remote-write endpoint |
PushInterval | 30s | Tick rate |
FederationEnabled | true | Master switch |
ListenAddr | :8080 | Bridge’s own /metrics + /health |
Push Loop
sequenceDiagram participant T as ticker(30s) participant B as Bridge participant E as Engine /metrics participant F as Folio /api/v1/write loop every PushInterval T->>B: tick B->>E: GET /metrics E-->>B: text-format families B->>B: parse → snappy(protobuf WriteRequest) B->>F: POST application/x-protobuf alt 2xx F-->>B: 200 B->>B: pushes_total++, last_push_timestamp=now, metrics_federated.Set(n) else error F-->>B: 5xx B->>B: push_errors_total++ end endBridge’s Own Metrics
perch_bridge_* lets the central system observe the courier itself:
perch_bridge_pushes_total— counterperch_bridge_push_errors_total— counterperch_bridge_push_duration_seconds— histogram, buckets[0.01, 0.05, 0.1, 0.5, 1, 2.5, 5, 10]perch_bridge_last_push_timestamp_seconds— gauge; alerting target if it stops incrementingperch_bridge_metrics_federated— gauge of how many series were in the last push
What does NOT pass through
The Bridge federates Prometheus metrics only. WebSocket events (nestr-007), structured logs, and traces have separate paths — Promtail/Loki for logs, Jaeger for traces (see nestr-010 for the trace-correlator).
Key Terms
- Federation → polling another Prometheus-compatible source and re-emitting the series, optionally with relabeling.
- Remote write → the protobuf-over-HTTP protocol Prometheus uses to push samples to a long-term store.
- Snappy → the framing compression Prometheus remote-write requires.
Q&A
Q: What single alert detects a stuck Bridge?
A: time() - perch_bridge_last_push_timestamp_seconds > 90 — three missed 30 s ticks.
Q: Does the Bridge re-send series on failure?
A: No. A failed push only increments push_errors_total; the next tick re-scrapes fresh series. There is no buffer on disk.
Q: Why pull from /metrics instead of being scraped directly?
A: Folio is outside the Engine’s network reach in the typical deployment. The Bridge sits inside the Engine’s cluster and is the only component allowed egress to Folio.
Examples
A 1 s p99 federation time on a workspace with ~400 series is normal. A jump to >5 s on the histogram usually points at a slow remote-write endpoint, not at the Engine — Engine scrape duration would not show in perch_bridge_push_duration_seconds.
neighbors on the map
- FNP Observability & Prometheus Metrics monitoring FNP systems
- Deployment Topology & Proxy Conflict Resolution setting up a new environment (kitten/cat/lion)
- LORE+CAIRNET Deployment Topology & Service Map understanding the LORE deployment architecture