CRUMB a card from devarno-cloud

Perch Bridge Federation

nestr intermediate 4 min read

ELI5

The Bridge is a delivery courier: every 30 seconds it visits the Engine’s metrics shop, picks up everything on the shelves, drops the haul at the Folio warehouse, and writes a postcard saying “made it, here’s how heavy the load was”.

Technical Deep Dive

perch/bridge/main.go is a small Go service whose job is one-way metric federation from the Engine into a remote /api/v1/write-compatible collector (“Folio”).

Configuration

Env / flagDefaultPurpose
NestrMetricsURLhttp://nestr-engine:9090/metricsSource of nestr_* series
FolioURL(required)Destination remote-write endpoint
PushInterval30sTick rate
FederationEnabledtrueMaster switch
ListenAddr:8080Bridge’s own /metrics + /health

Push Loop

sequenceDiagram
participant T as ticker(30s)
participant B as Bridge
participant E as Engine /metrics
participant F as Folio /api/v1/write
loop every PushInterval
T->>B: tick
B->>E: GET /metrics
E-->>B: text-format families
B->>B: parse → snappy(protobuf WriteRequest)
B->>F: POST application/x-protobuf
alt 2xx
F-->>B: 200
B->>B: pushes_total++, last_push_timestamp=now, metrics_federated.Set(n)
else error
F-->>B: 5xx
B->>B: push_errors_total++
end
end

Bridge’s Own Metrics

perch_bridge_* lets the central system observe the courier itself:

  • perch_bridge_pushes_total — counter
  • perch_bridge_push_errors_total — counter
  • perch_bridge_push_duration_seconds — histogram, buckets [0.01, 0.05, 0.1, 0.5, 1, 2.5, 5, 10]
  • perch_bridge_last_push_timestamp_seconds — gauge; alerting target if it stops incrementing
  • perch_bridge_metrics_federated — gauge of how many series were in the last push

What does NOT pass through

The Bridge federates Prometheus metrics only. WebSocket events (nestr-007), structured logs, and traces have separate paths — Promtail/Loki for logs, Jaeger for traces (see nestr-010 for the trace-correlator).

Key Terms

  • Federation → polling another Prometheus-compatible source and re-emitting the series, optionally with relabeling.
  • Remote write → the protobuf-over-HTTP protocol Prometheus uses to push samples to a long-term store.
  • Snappy → the framing compression Prometheus remote-write requires.

Q&A

Q: What single alert detects a stuck Bridge? A: time() - perch_bridge_last_push_timestamp_seconds > 90 — three missed 30 s ticks.

Q: Does the Bridge re-send series on failure? A: No. A failed push only increments push_errors_total; the next tick re-scrapes fresh series. There is no buffer on disk.

Q: Why pull from /metrics instead of being scraped directly? A: Folio is outside the Engine’s network reach in the typical deployment. The Bridge sits inside the Engine’s cluster and is the only component allowed egress to Folio.

Examples

A 1 s p99 federation time on a workspace with ~400 series is normal. A jump to >5 s on the histogram usually points at a slow remote-write endpoint, not at the Engine — Engine scrape duration would not show in perch_bridge_push_duration_seconds.

neighbors on the map