Storm Observability Pillars

sparki intermediate 6 min read

ELI5

Storm is the platform’s nervous system: it counts what happened (metrics), narrates what happened (logs), and draws maps of how requests moved (traces). It picks signals surgically — every metric, log, and trace must answer a specific operational question.

Technical Deep Dive

services/observability-storm/STORM_OBSERVABILITY_STRATEGY.md is the canonical document. The stack is Prometheus + StatsD for metrics, structured JSON logs, and distributed tracing across api-engine, deploy-loco, and auth-shield.

Four Instrumentation Surfaces

Surface	Examples
Application Layer	HTTP latency p50/p95/p99, error rates, status code distribution, goroutines, heap, GC pauses
Pipeline & Build	Stage timings (detect → generate → execute → deploy), job queue size, executor container startup, test pass/fail
Infrastructure	CPU, memory, disk I/O, container memory pressure, K8s pod restarts, DB connection saturation, replication lag
Business & SLO	Error-budget consumed, fast/slow burn rate, availability of detect→build→deploy, MTTR

Trace Boundaries

Operation	Start	End	Span Tags
Project Detection	API call received	Detection result stored	`operation=detect, project_id, language_count, time_ms`
Pipeline Generation	Detection complete	Pipeline config persisted	`operation=generate, framework_count, stage_count`
Build Execution	Job dequeued	Build log archived	`operation=build, job_id, executor_type, status`
Deployment	Deployment triggered	Target reached	`operation=deploy, target_env, strategy, duration_ms`
Chaos Test	Test injected	Chaos resolved	`operation=chaos, test_type, failure_mode, recovery_time`

Signal Flow

flowchart LR
  subgraph Services
    AE[api-engine]
    DL[deploy-loco]
    AS[auth-shield]
  end
  AE -- Prometheus scrape --> PR[(Prometheus)]
  DL -- Prometheus scrape --> PR
  AS -- Prometheus scrape --> PR
  AE -- OTLP traces --> OT[(OTel Collector)]
  DL -- OTLP traces --> OT
  OT --> JG[(Jaeger/Tempo)]
  AE -- JSON logs --> LK[(Loki)]
  DL -- JSON logs --> LK
  AS -- JSON logs --> LK
  PR --> GR[Grafana dashboards]
  JG --> GR
  LK --> GR
  PR --> AM[Alertmanager]
  AM --> ON[on-call]

Error Budget Vocabulary

budget consumed — fraction of the SLO window’s allowed errors used so far.
fast burn — short-window high-rate error consumption (page now).
slow burn — long-window slow drift toward exhausting the budget (ticket).
MTTR — mean time to recovery, tracked as a business metric.

Key Terms

instrumentation surface — one of the four bands above; the strategy explicitly avoids “everything is a metric”
trace boundary — a defined start/end pair for a span, with mandatory tags
error budget — 1 - SLO; the allowed bad-event quota in a rolling window
burn rate — speed at which the budget is being consumed; multi-window alerting (fast & slow)

Q&A

Q: Why not log everything and grep later? A: Cost and signal-to-noise. The strategy prioritises “surgical precision”: each signal must answer a question. Volume logging defeats both the bill and the engineer’s attention.

Q: Where do business metrics like MTTR live? A: In the SLO surface (band 4). They are derived from incident records + alert timestamps, not from raw application metrics.

Q: Are CLI invocations traced? A: Yes — the strategy lists CLI commands as trace entry points alongside REST and gRPC. The CLI injects a trace context that survives across any backend calls it makes.

Examples

A push-to-deploy crosses four spans: operation=detect (api-engine, ~200ms) → operation=generate (api-engine, ~50ms) → operation=build (executor, ~5min, tagged with executor_type=docker) → operation=deploy (deploy-loco, ~90s, tagged with target_env=prod, strategy=canary). One trace ID joins all four; Grafana stitches the waterfall.

neighbors on the map

FNP Observability & Prometheus Metrics monitoring FNP systems
Docker Compose & Observability Stack deploying iris-service locally or in production
OpenTelemetry Instrumentation & Metrics adding observability to iris-service code