Observability Stack
tektree intermediate 5 min read
ELI5
Observability is the building’s nervous system: every service writes a journal entry (logs), every request hits a counter on the wall (metrics), and a coloured thread is tied to each request as it walks through the building (traces). Grafana is the security room with monitors showing all three.
Technical Deep Dive
infra/observability/docker-compose.yml brings up Grafana + Prometheus + Alertmanager with the bundled dashboards and rules. The plan in docs/docs/architecture/OBSERVABILITY_PLAN.md defines what every service must emit.
Telemetry Stack
flowchart LR subgraph App[each service] Z[Zap structured logs] P[Prometheus metrics endpoint] O[OpenTelemetry tracer] end Z -->|JSON, ISO8601| LogStore[(log store / archive)] P -->|/metrics scrape| Prom[(Prometheus)] O -->|traceparent| Collector[OTel collector] --> Tracer[(trace backend)] Prom --> Graf[Grafana] Tracer --> Graf Prom --> AM[Alertmanager]Logging
Per OBSERVABILITY_PLAN.md:
- Format: JSON
- Required fields:
timestamp,level,service,trace_id,message - Levels:
DEBUG(dev only),INFO,WARN,ERROR,FATAL - Retention:
INFO7 d;WARN/ERROR/FATAL30 d; archives 90 d
The Zap wrapper in libs/shared-go/pkg/logging enforces the field set with WithTraceID, WithUserID, WithService helpers.
Metrics
Two metric families ship from libs/shared-go/pkg/metrics:
| Metric | Type | Labels |
|---|---|---|
{service}_requests_total | CounterVec | method, endpoint, status |
{service}_request_duration_seconds | HistogramVec | method, endpoint |
{service}_requests_in_flight | Gauge | — |
{service}_errors_total | CounterVec | type |
{service}_events_published_total | CounterVec | event_type |
{service}_events_consumed_total | CounterVec | event_type |
{service}_event_latency_seconds | HistogramVec | event_type |
{service}_event_errors_total | CounterVec | event_type, error_type |
Business metrics defined in the plan (additional, not framework-shipped):
| Metric | Labels |
|---|---|
gamification_xp_earned_total | source |
payment_subscriptions_total | tier, status |
knowledge_content_created_total | type |
Tracing
OpenTelemetry with W3C traceparent propagation across HTTP and event-bus boundaries. Production sampling: 1% head-based (the catalog calls this out — anything more is too much for the volumes expected). Each span tags service, endpoint, user_id (when known), trace_id matches the structured log field so a click from logs to traces lines up.
Alerts
Alertmanager rules in infra/observability/ should at minimum cover:
- request error rate per service > threshold (5xx / total)
- request p95 latency above SLO
- event consumer lag (
pending_entriesfrom Redis Streams) - worker-pool stream length growth rate
Required Per-Service Wiring
A new service must, at minimum:
- Register
ServiceMetricsand expose/metrics. - Use
Logger.WithService(name).WithTraceID(...)for every request log. - Inject
traceparentheaders when calling other services and when publishing events (setMetadata["traceparent"]on the event envelope).
Skipping any of those breaks the cross-service trace stitching.
Key Terms
- Trace → an end-to-end DAG of spans, identified by
trace_id. - Span → one unit of work within a trace; has
service,endpoint, duration. traceparent→ W3C header carrying the parent span context across boundaries.- Sampling rate → percentage of traces actually exported (1% in prod).
- Pending entries → Redis Streams’ un-ACKed backlog; the canonical “consumer lag” signal.
Q&A
Q: A request flows gateway → knowledge-service → event publish → gamification-service. The gamification log row has no trace_id. What broke?
A: The event envelope’s Metadata["traceparent"] was not set when published; the consumer rebuilt a fresh trace context. Fix the publisher to copy the active span context into Metadata before Publish.
Q: Why is there a separate errors_total when status-3xx-aggregated counts already capture failures?
A: errors_total labels by Go error category (network, validation, internal, …) for failure-mode analysis. HTTP status alone collapses too many root causes into 500.
Q: A new histogram for “DB query duration” — what bucket bounds?
A: Match the framework’s existing request_duration_seconds buckets unless you have evidence the distribution differs. Cross-metric comparability is worth more than locally-tuned buckets.
Examples
Adding a custom counter for accepted answers:
acceptedAnswers := prometheus.NewCounter(prometheus.CounterOpts{ Name: "knowledge_answer_accepted_total", Help: "Number of answers marked accepted",})prometheus.MustRegister(acceptedAnswers)// in handler: acceptedAnswers.Inc()neighbors on the map
- FNP Observability & Prometheus Metrics monitoring FNP systems