Observability Stack

tektree intermediate 5 min read

ELI5

Observability is the building’s nervous system: every service writes a journal entry (logs), every request hits a counter on the wall (metrics), and a coloured thread is tied to each request as it walks through the building (traces). Grafana is the security room with monitors showing all three.

Technical Deep Dive

infra/observability/docker-compose.yml brings up Grafana + Prometheus + Alertmanager with the bundled dashboards and rules. The plan in docs/docs/architecture/OBSERVABILITY_PLAN.md defines what every service must emit.

Telemetry Stack

flowchart LR
    subgraph App[each service]
      Z[Zap structured logs]
      P[Prometheus metrics endpoint]
      O[OpenTelemetry tracer]
    end
    Z -->|JSON, ISO8601| LogStore[(log store / archive)]
    P -->|/metrics scrape| Prom[(Prometheus)]
    O -->|traceparent| Collector[OTel collector] --> Tracer[(trace backend)]
    Prom --> Graf[Grafana]
    Tracer --> Graf
    Prom --> AM[Alertmanager]

Logging

Per OBSERVABILITY_PLAN.md:

Format: JSON
Required fields: timestamp, level, service, trace_id, message
Levels: DEBUG (dev only), INFO, WARN, ERROR, FATAL
Retention: INFO 7 d; WARN/ERROR/FATAL 30 d; archives 90 d

The Zap wrapper in libs/shared-go/pkg/logging enforces the field set with WithTraceID, WithUserID, WithService helpers.

Metrics

Two metric families ship from libs/shared-go/pkg/metrics:

Metric	Type	Labels
`{service}_requests_total`	CounterVec	`method`, `endpoint`, `status`
`{service}_request_duration_seconds`	HistogramVec	`method`, `endpoint`
`{service}_requests_in_flight`	Gauge	—
`{service}_errors_total`	CounterVec	`type`
`{service}_events_published_total`	CounterVec	`event_type`
`{service}_events_consumed_total`	CounterVec	`event_type`
`{service}_event_latency_seconds`	HistogramVec	`event_type`
`{service}_event_errors_total`	CounterVec	`event_type`, `error_type`

Business metrics defined in the plan (additional, not framework-shipped):

Metric	Labels
`gamification_xp_earned_total`	`source`
`payment_subscriptions_total`	`tier`, `status`
`knowledge_content_created_total`	`type`

Tracing

OpenTelemetry with W3C traceparent propagation across HTTP and event-bus boundaries. Production sampling: 1% head-based (the catalog calls this out — anything more is too much for the volumes expected). Each span tags service, endpoint, user_id (when known), trace_id matches the structured log field so a click from logs to traces lines up.

Alerts

Alertmanager rules in infra/observability/ should at minimum cover:

request error rate per service > threshold (5xx / total)
request p95 latency above SLO
event consumer lag (pending_entries from Redis Streams)
worker-pool stream length growth rate

Required Per-Service Wiring

A new service must, at minimum:

Register ServiceMetrics and expose /metrics.
Use Logger.WithService(name).WithTraceID(...) for every request log.
Inject traceparent headers when calling other services and when publishing events (set Metadata["traceparent"] on the event envelope).

Skipping any of those breaks the cross-service trace stitching.

Key Terms

Trace → an end-to-end DAG of spans, identified by trace_id.
Span → one unit of work within a trace; has service, endpoint, duration.
traceparent → W3C header carrying the parent span context across boundaries.
Sampling rate → percentage of traces actually exported (1% in prod).
Pending entries → Redis Streams’ un-ACKed backlog; the canonical “consumer lag” signal.

Q&A

Q: A request flows gateway → knowledge-service → event publish → gamification-service. The gamification log row has no trace_id. What broke? A: The event envelope’s Metadata["traceparent"] was not set when published; the consumer rebuilt a fresh trace context. Fix the publisher to copy the active span context into Metadata before Publish.

Q: Why is there a separate errors_total when status-3xx-aggregated counts already capture failures? A: errors_total labels by Go error category (network, validation, internal, …) for failure-mode analysis. HTTP status alone collapses too many root causes into 500.

Q: A new histogram for “DB query duration” — what bucket bounds? A: Match the framework’s existing request_duration_seconds buckets unless you have evidence the distribution differs. Cross-metric comparability is worth more than locally-tuned buckets.

Examples

Adding a custom counter for accepted answers:

acceptedAnswers := prometheus.NewCounter(prometheus.CounterOpts{
    Name: "knowledge_answer_accepted_total",
    Help: "Number of answers marked accepted",
})
prometheus.MustRegister(acceptedAnswers)
// in handler: acceptedAnswers.Inc()

neighbors on the map

FNP Observability & Prometheus Metrics monitoring FNP systems