Perch Custom Services Roster
nestr intermediate 6 min read
ELI5
Perch ships four little assistants on top of vanilla Prometheus: an accountant (cost-monitor), a coach with a clipboard (slo-tracker), a security guard at the deploy door (policy-enforcer), and a detective with a magnifying glass linking traces to logs (trace-correlator).
Technical Deep Dive
flowchart TB P[Prometheus] L[Loki] J[Jaeger] GH[GitHub PR] SL[Slack / PagerDuty]
subgraph Custom["Perch custom services (Go)"] CM[cost-monitor :8080] ST[slo-tracker :8080] PE[policy-enforcer :8080] TC[trace-correlator :8080] end
P --> CM P --> ST P --> PE ST -->|error budget| PE GH -->|/webhook/github| PE PE --> SL L --> TC J --> TC P --> TCcost-monitor (perch/cost-monitor/main.go)
- Polls Prometheus on
UpdateInterval(default 5 m). - Endpoints:
GET /api/v1/costs,GET /api/v1/capacity,GET /api/v1/recommendations. - Inputs: resource metrics (CPU, memory, storage),
CloudProvider,Regionfor unit pricing. - Outputs: per-service cost, capacity utilisation, rightsizing recommendations as JSON.
slo-tracker (perch/slo-tracker/main.go)
- Update interval default 60 s.
- Endpoints:
GET /api/v1/slo/status?service=X,GET /api/v1/error-budget?service=X&window=30d. - Re-emits its own SLO metrics into Prometheus so dashboards and alerts can subscribe; downstream of cost-monitor for spend-per-9 calculations.
policy-enforcer (perch/policy-enforcer/main.go)
- Three modes:
audit,warn,enforce. - Endpoints:
GET /api/v1/policy/check?service=X,GET /api/v1/policy/status,POST /webhook/github. - The GitHub webhook reads the slo-tracker’s error budget for the touched service and either lets the merge through, posts a warning, or blocks (depending on mode).
- Notifies Slack on enforcement actions.
trace-correlator (perch/trace-correlator/main.go)
- Bridges Jaeger ↔ Loki ↔ Prometheus by trace ID and request ID.
- Endpoints:
GET /api/v1/trace/logs?trace_id=X&service=Y,GET /api/v1/logs/trace?request_id=X,GET /api/v1/correlate?id=X&start_time=…&end_time=…. - Default time window: last hour when not specified — important to remember when correlating an old incident.
Operator Decision Table
| Question | Service to call |
|---|---|
| ”How much does service X cost this month?“ | cost-monitor /api/v1/costs |
| ”Is service X within its SLO?“ | slo-tracker /api/v1/slo/status |
| ”Can I merge this PR safely?“ | policy-enforcer /api/v1/policy/check |
| ”Show me logs for this trace ID” | trace-correlator /api/v1/trace/logs |
Key Terms
- Error budget →
1 - SLOminus consumed unreliability over a window; policy-enforcer gates on this. - Rightsizing → cost-monitor’s recommendation to scale resources up or down based on observed utilisation.
- Audit / warn / enforce → policy-enforcer modes; only
enforceblocks merges, the other two are observe-only.
Q&A
Q: What happens if the slo-tracker is down when policy-enforcer is asked? A: The webhook handler degrades to “warn” semantics — it cannot prove the budget is exhausted, so it does not block, but it logs and posts a Slack warning.
Q: Does cost-monitor write back to Prometheus? A: Recommendations are exposed only over its own JSON API; cost time-series themselves are computed on the fly from PromQL queries.
Q: Why split slo-tracker and policy-enforcer instead of one service? A: SLO computation is read-mostly and cacheable; policy enforcement carries a write surface (webhooks, Slack, GitHub status). Separating them keeps blast radius small when the gating logic changes.
Examples
A typical “block merge” flow: GitHub posts a pull_request event → policy-enforcer queries slo-tracker for service=api,window=30d → budget consumption is 110 % → in enforce mode, return a failed status check on the PR and post a Slack message naming the offending service.
neighbors on the map
- FNP Cost Optimization & Karpenter optimizing cloud infrastructure costs
- FNP Observability & Prometheus Metrics monitoring FNP systems
- Deployment Topology & Proxy Conflict Resolution setting up a new environment (kitten/cat/lion)
- Run Outcome Classification interpreting a History row's status pill