CRUMB a card from devarno-cloud

Perch Custom Services Roster

nestr intermediate 6 min read

ELI5

Perch ships four little assistants on top of vanilla Prometheus: an accountant (cost-monitor), a coach with a clipboard (slo-tracker), a security guard at the deploy door (policy-enforcer), and a detective with a magnifying glass linking traces to logs (trace-correlator).

Technical Deep Dive

flowchart TB
P[Prometheus]
L[Loki]
J[Jaeger]
GH[GitHub PR]
SL[Slack / PagerDuty]
subgraph Custom["Perch custom services (Go)"]
CM[cost-monitor :8080]
ST[slo-tracker :8080]
PE[policy-enforcer :8080]
TC[trace-correlator :8080]
end
P --> CM
P --> ST
P --> PE
ST -->|error budget| PE
GH -->|/webhook/github| PE
PE --> SL
L --> TC
J --> TC
P --> TC

cost-monitor (perch/cost-monitor/main.go)

  • Polls Prometheus on UpdateInterval (default 5 m).
  • Endpoints: GET /api/v1/costs, GET /api/v1/capacity, GET /api/v1/recommendations.
  • Inputs: resource metrics (CPU, memory, storage), CloudProvider, Region for unit pricing.
  • Outputs: per-service cost, capacity utilisation, rightsizing recommendations as JSON.

slo-tracker (perch/slo-tracker/main.go)

  • Update interval default 60 s.
  • Endpoints: GET /api/v1/slo/status?service=X, GET /api/v1/error-budget?service=X&window=30d.
  • Re-emits its own SLO metrics into Prometheus so dashboards and alerts can subscribe; downstream of cost-monitor for spend-per-9 calculations.

policy-enforcer (perch/policy-enforcer/main.go)

  • Three modes: audit, warn, enforce.
  • Endpoints: GET /api/v1/policy/check?service=X, GET /api/v1/policy/status, POST /webhook/github.
  • The GitHub webhook reads the slo-tracker’s error budget for the touched service and either lets the merge through, posts a warning, or blocks (depending on mode).
  • Notifies Slack on enforcement actions.

trace-correlator (perch/trace-correlator/main.go)

  • Bridges Jaeger ↔ Loki ↔ Prometheus by trace ID and request ID.
  • Endpoints: GET /api/v1/trace/logs?trace_id=X&service=Y, GET /api/v1/logs/trace?request_id=X, GET /api/v1/correlate?id=X&start_time=…&end_time=….
  • Default time window: last hour when not specified — important to remember when correlating an old incident.

Operator Decision Table

QuestionService to call
”How much does service X cost this month?“cost-monitor /api/v1/costs
”Is service X within its SLO?“slo-tracker /api/v1/slo/status
”Can I merge this PR safely?“policy-enforcer /api/v1/policy/check
”Show me logs for this trace ID”trace-correlator /api/v1/trace/logs

Key Terms

  • Error budget1 - SLO minus consumed unreliability over a window; policy-enforcer gates on this.
  • Rightsizing → cost-monitor’s recommendation to scale resources up or down based on observed utilisation.
  • Audit / warn / enforce → policy-enforcer modes; only enforce blocks merges, the other two are observe-only.

Q&A

Q: What happens if the slo-tracker is down when policy-enforcer is asked? A: The webhook handler degrades to “warn” semantics — it cannot prove the budget is exhausted, so it does not block, but it logs and posts a Slack warning.

Q: Does cost-monitor write back to Prometheus? A: Recommendations are exposed only over its own JSON API; cost time-series themselves are computed on the fly from PromQL queries.

Q: Why split slo-tracker and policy-enforcer instead of one service? A: SLO computation is read-mostly and cacheable; policy enforcement carries a write surface (webhooks, Slack, GitHub status). Separating them keeps blast radius small when the gating logic changes.

Examples

A typical “block merge” flow: GitHub posts a pull_request event → policy-enforcer queries slo-tracker for service=api,window=30d → budget consumption is 110 % → in enforce mode, return a failed status check on the PR and post a Slack message naming the offending service.

neighbors on the map