Storm Observability Pillars
sparki intermediate 6 min read
ELI5
Storm is the platform’s nervous system: it counts what happened (metrics), narrates what happened (logs), and draws maps of how requests moved (traces). It picks signals surgically — every metric, log, and trace must answer a specific operational question.
Technical Deep Dive
services/observability-storm/STORM_OBSERVABILITY_STRATEGY.md is the canonical document. The stack is Prometheus + StatsD for metrics, structured JSON logs, and distributed tracing across api-engine, deploy-loco, and auth-shield.
Four Instrumentation Surfaces
| Surface | Examples |
|---|---|
| Application Layer | HTTP latency p50/p95/p99, error rates, status code distribution, goroutines, heap, GC pauses |
| Pipeline & Build | Stage timings (detect → generate → execute → deploy), job queue size, executor container startup, test pass/fail |
| Infrastructure | CPU, memory, disk I/O, container memory pressure, K8s pod restarts, DB connection saturation, replication lag |
| Business & SLO | Error-budget consumed, fast/slow burn rate, availability of detect→build→deploy, MTTR |
Trace Boundaries
| Operation | Start | End | Span Tags |
|---|---|---|---|
| Project Detection | API call received | Detection result stored | operation=detect, project_id, language_count, time_ms |
| Pipeline Generation | Detection complete | Pipeline config persisted | operation=generate, framework_count, stage_count |
| Build Execution | Job dequeued | Build log archived | operation=build, job_id, executor_type, status |
| Deployment | Deployment triggered | Target reached | operation=deploy, target_env, strategy, duration_ms |
| Chaos Test | Test injected | Chaos resolved | operation=chaos, test_type, failure_mode, recovery_time |
Signal Flow
flowchart LR subgraph Services AE[api-engine] DL[deploy-loco] AS[auth-shield] end AE -- Prometheus scrape --> PR[(Prometheus)] DL -- Prometheus scrape --> PR AS -- Prometheus scrape --> PR AE -- OTLP traces --> OT[(OTel Collector)] DL -- OTLP traces --> OT OT --> JG[(Jaeger/Tempo)] AE -- JSON logs --> LK[(Loki)] DL -- JSON logs --> LK AS -- JSON logs --> LK PR --> GR[Grafana dashboards] JG --> GR LK --> GR PR --> AM[Alertmanager] AM --> ON[on-call]Error Budget Vocabulary
- budget consumed — fraction of the SLO window’s allowed errors used so far.
- fast burn — short-window high-rate error consumption (page now).
- slow burn — long-window slow drift toward exhausting the budget (ticket).
- MTTR — mean time to recovery, tracked as a business metric.
Key Terms
- instrumentation surface — one of the four bands above; the strategy explicitly avoids “everything is a metric”
- trace boundary — a defined start/end pair for a span, with mandatory tags
- error budget —
1 - SLO; the allowed bad-event quota in a rolling window - burn rate — speed at which the budget is being consumed; multi-window alerting (fast & slow)
Q&A
Q: Why not log everything and grep later? A: Cost and signal-to-noise. The strategy prioritises “surgical precision”: each signal must answer a question. Volume logging defeats both the bill and the engineer’s attention.
Q: Where do business metrics like MTTR live? A: In the SLO surface (band 4). They are derived from incident records + alert timestamps, not from raw application metrics.
Q: Are CLI invocations traced? A: Yes — the strategy lists CLI commands as trace entry points alongside REST and gRPC. The CLI injects a trace context that survives across any backend calls it makes.
Examples
A push-to-deploy crosses four spans: operation=detect (api-engine, ~200ms) → operation=generate (api-engine, ~50ms) → operation=build (executor, ~5min, tagged with executor_type=docker) → operation=deploy (deploy-loco, ~90s, tagged with target_env=prod, strategy=canary). One trace ID joins all four; Grafana stitches the waterfall.
neighbors on the map
- FNP Observability & Prometheus Metrics monitoring FNP systems
- Docker Compose & Observability Stack deploying iris-service locally or in production
- OpenTelemetry Instrumentation & Metrics adding observability to iris-service code