Perch Custom Services Roster

nestr intermediate 6 min read

ELI5

Perch ships four little assistants on top of vanilla Prometheus: an accountant (cost-monitor), a coach with a clipboard (slo-tracker), a security guard at the deploy door (policy-enforcer), and a detective with a magnifying glass linking traces to logs (trace-correlator).

Technical Deep Dive

flowchart TB
    P[Prometheus]
    L[Loki]
    J[Jaeger]
    GH[GitHub PR]
    SL[Slack / PagerDuty]

    subgraph Custom["Perch custom services (Go)"]
        CM[cost-monitor :8080]
        ST[slo-tracker :8080]
        PE[policy-enforcer :8080]
        TC[trace-correlator :8080]
    end

    P --> CM
    P --> ST
    P --> PE
    ST -->|error budget| PE
    GH -->|/webhook/github| PE
    PE --> SL
    L --> TC
    J --> TC
    P --> TC

cost-monitor (`perch/cost-monitor/main.go`)

Polls Prometheus on UpdateInterval (default 5 m).
Endpoints: GET /api/v1/costs, GET /api/v1/capacity, GET /api/v1/recommendations.
Inputs: resource metrics (CPU, memory, storage), CloudProvider, Region for unit pricing.
Outputs: per-service cost, capacity utilisation, rightsizing recommendations as JSON.

slo-tracker (`perch/slo-tracker/main.go`)

Update interval default 60 s.
Endpoints: GET /api/v1/slo/status?service=X, GET /api/v1/error-budget?service=X&window=30d.
Re-emits its own SLO metrics into Prometheus so dashboards and alerts can subscribe; downstream of cost-monitor for spend-per-9 calculations.

policy-enforcer (`perch/policy-enforcer/main.go`)

Three modes: audit, warn, enforce.
Endpoints: GET /api/v1/policy/check?service=X, GET /api/v1/policy/status, POST /webhook/github.
The GitHub webhook reads the slo-tracker’s error budget for the touched service and either lets the merge through, posts a warning, or blocks (depending on mode).
Notifies Slack on enforcement actions.

trace-correlator (`perch/trace-correlator/main.go`)

Bridges Jaeger ↔ Loki ↔ Prometheus by trace ID and request ID.
Endpoints: GET /api/v1/trace/logs?trace_id=X&service=Y, GET /api/v1/logs/trace?request_id=X, GET /api/v1/correlate?id=X&start_time=…&end_time=….
Default time window: last hour when not specified — important to remember when correlating an old incident.

Operator Decision Table

Question	Service to call
”How much does service X cost this month?“	cost-monitor `/api/v1/costs`
”Is service X within its SLO?“	slo-tracker `/api/v1/slo/status`
”Can I merge this PR safely?“	policy-enforcer `/api/v1/policy/check`
”Show me logs for this trace ID”	trace-correlator `/api/v1/trace/logs`

Key Terms

Error budget → 1 - SLO minus consumed unreliability over a window; policy-enforcer gates on this.
Rightsizing → cost-monitor’s recommendation to scale resources up or down based on observed utilisation.
Audit / warn / enforce → policy-enforcer modes; only enforce blocks merges, the other two are observe-only.

Q&A

Q: What happens if the slo-tracker is down when policy-enforcer is asked? A: The webhook handler degrades to “warn” semantics — it cannot prove the budget is exhausted, so it does not block, but it logs and posts a Slack warning.

Q: Does cost-monitor write back to Prometheus? A: Recommendations are exposed only over its own JSON API; cost time-series themselves are computed on the fly from PromQL queries.

Q: Why split slo-tracker and policy-enforcer instead of one service? A: SLO computation is read-mostly and cacheable; policy enforcement carries a write surface (webhooks, Slack, GitHub status). Separating them keeps blast radius small when the gating logic changes.

Examples

A typical “block merge” flow: GitHub posts a pull_request event → policy-enforcer queries slo-tracker for service=api,window=30d → budget consumption is 110 % → in enforce mode, return a failed status check on the PR and post a Slack message naming the offending service.

neighbors on the map

FNP Cost Optimization & Karpenter optimizing cloud infrastructure costs
FNP Observability & Prometheus Metrics monitoring FNP systems
Deployment Topology & Proxy Conflict Resolution setting up a new environment (kitten/cat/lion)
Run Outcome Classification interpreting a History row's status pill

ELI5

Technical Deep Dive

cost-monitor (perch/cost-monitor/main.go)

slo-tracker (perch/slo-tracker/main.go)

policy-enforcer (perch/policy-enforcer/main.go)

trace-correlator (perch/trace-correlator/main.go)