CRUMB a card from devarno-cloud

Observability Stack

tektree intermediate 5 min read

ELI5

Observability is the building’s nervous system: every service writes a journal entry (logs), every request hits a counter on the wall (metrics), and a coloured thread is tied to each request as it walks through the building (traces). Grafana is the security room with monitors showing all three.

Technical Deep Dive

infra/observability/docker-compose.yml brings up Grafana + Prometheus + Alertmanager with the bundled dashboards and rules. The plan in docs/docs/architecture/OBSERVABILITY_PLAN.md defines what every service must emit.

Telemetry Stack

flowchart LR
subgraph App[each service]
Z[Zap structured logs]
P[Prometheus metrics endpoint]
O[OpenTelemetry tracer]
end
Z -->|JSON, ISO8601| LogStore[(log store / archive)]
P -->|/metrics scrape| Prom[(Prometheus)]
O -->|traceparent| Collector[OTel collector] --> Tracer[(trace backend)]
Prom --> Graf[Grafana]
Tracer --> Graf
Prom --> AM[Alertmanager]

Logging

Per OBSERVABILITY_PLAN.md:

  • Format: JSON
  • Required fields: timestamp, level, service, trace_id, message
  • Levels: DEBUG (dev only), INFO, WARN, ERROR, FATAL
  • Retention: INFO 7 d; WARN/ERROR/FATAL 30 d; archives 90 d

The Zap wrapper in libs/shared-go/pkg/logging enforces the field set with WithTraceID, WithUserID, WithService helpers.

Metrics

Two metric families ship from libs/shared-go/pkg/metrics:

MetricTypeLabels
{service}_requests_totalCounterVecmethod, endpoint, status
{service}_request_duration_secondsHistogramVecmethod, endpoint
{service}_requests_in_flightGauge
{service}_errors_totalCounterVectype
{service}_events_published_totalCounterVecevent_type
{service}_events_consumed_totalCounterVecevent_type
{service}_event_latency_secondsHistogramVecevent_type
{service}_event_errors_totalCounterVecevent_type, error_type

Business metrics defined in the plan (additional, not framework-shipped):

MetricLabels
gamification_xp_earned_totalsource
payment_subscriptions_totaltier, status
knowledge_content_created_totaltype

Tracing

OpenTelemetry with W3C traceparent propagation across HTTP and event-bus boundaries. Production sampling: 1% head-based (the catalog calls this out — anything more is too much for the volumes expected). Each span tags service, endpoint, user_id (when known), trace_id matches the structured log field so a click from logs to traces lines up.

Alerts

Alertmanager rules in infra/observability/ should at minimum cover:

  • request error rate per service > threshold (5xx / total)
  • request p95 latency above SLO
  • event consumer lag (pending_entries from Redis Streams)
  • worker-pool stream length growth rate

Required Per-Service Wiring

A new service must, at minimum:

  1. Register ServiceMetrics and expose /metrics.
  2. Use Logger.WithService(name).WithTraceID(...) for every request log.
  3. Inject traceparent headers when calling other services and when publishing events (set Metadata["traceparent"] on the event envelope).

Skipping any of those breaks the cross-service trace stitching.

Key Terms

  • Trace → an end-to-end DAG of spans, identified by trace_id.
  • Span → one unit of work within a trace; has service, endpoint, duration.
  • traceparent → W3C header carrying the parent span context across boundaries.
  • Sampling rate → percentage of traces actually exported (1% in prod).
  • Pending entries → Redis Streams’ un-ACKed backlog; the canonical “consumer lag” signal.

Q&A

Q: A request flows gateway → knowledge-service → event publish → gamification-service. The gamification log row has no trace_id. What broke? A: The event envelope’s Metadata["traceparent"] was not set when published; the consumer rebuilt a fresh trace context. Fix the publisher to copy the active span context into Metadata before Publish.

Q: Why is there a separate errors_total when status-3xx-aggregated counts already capture failures? A: errors_total labels by Go error category (network, validation, internal, …) for failure-mode analysis. HTTP status alone collapses too many root causes into 500.

Q: A new histogram for “DB query duration” — what bucket bounds? A: Match the framework’s existing request_duration_seconds buckets unless you have evidence the distribution differs. Cross-metric comparability is worth more than locally-tuned buckets.

Examples

Adding a custom counter for accepted answers:

acceptedAnswers := prometheus.NewCounter(prometheus.CounterOpts{
Name: "knowledge_answer_accepted_total",
Help: "Number of answers marked accepted",
})
prometheus.MustRegister(acceptedAnswers)
// in handler: acceptedAnswers.Inc()

neighbors on the map