CRUMB a card from devarno-cloud

Storm Observability Pillars

sparki intermediate 6 min read

ELI5

Storm is the platform’s nervous system: it counts what happened (metrics), narrates what happened (logs), and draws maps of how requests moved (traces). It picks signals surgically — every metric, log, and trace must answer a specific operational question.

Technical Deep Dive

services/observability-storm/STORM_OBSERVABILITY_STRATEGY.md is the canonical document. The stack is Prometheus + StatsD for metrics, structured JSON logs, and distributed tracing across api-engine, deploy-loco, and auth-shield.

Four Instrumentation Surfaces

SurfaceExamples
Application LayerHTTP latency p50/p95/p99, error rates, status code distribution, goroutines, heap, GC pauses
Pipeline & BuildStage timings (detect → generate → execute → deploy), job queue size, executor container startup, test pass/fail
InfrastructureCPU, memory, disk I/O, container memory pressure, K8s pod restarts, DB connection saturation, replication lag
Business & SLOError-budget consumed, fast/slow burn rate, availability of detect→build→deploy, MTTR

Trace Boundaries

OperationStartEndSpan Tags
Project DetectionAPI call receivedDetection result storedoperation=detect, project_id, language_count, time_ms
Pipeline GenerationDetection completePipeline config persistedoperation=generate, framework_count, stage_count
Build ExecutionJob dequeuedBuild log archivedoperation=build, job_id, executor_type, status
DeploymentDeployment triggeredTarget reachedoperation=deploy, target_env, strategy, duration_ms
Chaos TestTest injectedChaos resolvedoperation=chaos, test_type, failure_mode, recovery_time

Signal Flow

flowchart LR
subgraph Services
AE[api-engine]
DL[deploy-loco]
AS[auth-shield]
end
AE -- Prometheus scrape --> PR[(Prometheus)]
DL -- Prometheus scrape --> PR
AS -- Prometheus scrape --> PR
AE -- OTLP traces --> OT[(OTel Collector)]
DL -- OTLP traces --> OT
OT --> JG[(Jaeger/Tempo)]
AE -- JSON logs --> LK[(Loki)]
DL -- JSON logs --> LK
AS -- JSON logs --> LK
PR --> GR[Grafana dashboards]
JG --> GR
LK --> GR
PR --> AM[Alertmanager]
AM --> ON[on-call]

Error Budget Vocabulary

  • budget consumed — fraction of the SLO window’s allowed errors used so far.
  • fast burn — short-window high-rate error consumption (page now).
  • slow burn — long-window slow drift toward exhausting the budget (ticket).
  • MTTR — mean time to recovery, tracked as a business metric.

Key Terms

  • instrumentation surface — one of the four bands above; the strategy explicitly avoids “everything is a metric”
  • trace boundary — a defined start/end pair for a span, with mandatory tags
  • error budget1 - SLO; the allowed bad-event quota in a rolling window
  • burn rate — speed at which the budget is being consumed; multi-window alerting (fast & slow)

Q&A

Q: Why not log everything and grep later? A: Cost and signal-to-noise. The strategy prioritises “surgical precision”: each signal must answer a question. Volume logging defeats both the bill and the engineer’s attention.

Q: Where do business metrics like MTTR live? A: In the SLO surface (band 4). They are derived from incident records + alert timestamps, not from raw application metrics.

Q: Are CLI invocations traced? A: Yes — the strategy lists CLI commands as trace entry points alongside REST and gRPC. The CLI injects a trace context that survives across any backend calls it makes.

Examples

A push-to-deploy crosses four spans: operation=detect (api-engine, ~200ms) → operation=generate (api-engine, ~50ms) → operation=build (executor, ~5min, tagged with executor_type=docker) → operation=deploy (deploy-loco, ~90s, tagged with target_env=prod, strategy=canary). One trace ID joins all four; Grafana stitches the waterfall.

neighbors on the map