Observability State & Gaps

traceo beginner 4 min read

ELI5

Traceo can tell you it’s alive (/health) and roughly how busy the engine is (/metrics), but it cannot yet tell you which user’s request caused which downstream call. The dashboards have outlines drawn but most of the lights aren’t wired up — this card is the wiring map of what’s lit and what’s dark.

Technical Deep Dive

Current State (per `ARCHITECTURE_OVERVIEW.md` §Observability Gaps)

Pillar	What exists	What’s missing
Logging	stdlib loggers per module	structured/JSON formatters
Metrics (Engine)	`/metrics` Prometheus stub: uptime gauge, jobs-by-status gauge	HTTP request metrics, service-level metrics
Metrics (MCP)	—	`/metrics` endpoint absent entirely
Tracing	—	correlation IDs, distributed tracing, context propagation
Health	MCP `/health`, Engine `/health` + `/ready`	dependency checks are basic (DB + rust_backend only)
Errors	15-type custom exception hierarchy with codes	global error formatter / handler

What `/metrics` Emits Today

traceo_engine_uptime_seconds <gauge>
traceo_jobs_total{status="pending|running|completed|failed"} <gauge>

That is the entire surface. There are no traceo_http_requests_total, no histograms, no per-tool counters — alerts have to be authored against logs (which are unstructured) or against the existence of /health itself.

Hooks Already Wired

Component	Source
Sentry SDK	`traceo_mcp_server/sentry_integration.py`, `engine/src/engine/sentry_integration.py`
Audit logging	`services/audit.py` + `audit_logs` table (migration `004_audit_retention.sql`)
Per-module loggers	`traceo_mcp_server/logging_config.py`

Sentry catches unhandled exceptions; audit logging is request-driven (writes one row per mutation) and is not a metrics replacement.

Mindmap of the Gap

mindmap
  root((Observability))
    Logging
      stdlib loggers OK
      no JSON formatter
      no correlation id
    Metrics
      Engine /metrics basic
      MCP /metrics missing
      no HTTP histograms
      no tool counters
    Tracing
      no W3C traceparent
      no span propagation
    Health
      MCP /health
      Engine /health + /ready
      shallow dependency probes
    Errors
      custom exceptions ok
      error codes ok
      no global formatter

Why It Matters for the Job Pipeline

Without correlation IDs, a CSV upload that fails after the engine webhook reaches the MCP server cannot be traced end-to-end from a single log query. Today the operator has to grep both services by job_id (which only exists after the engine accepts the upload), missing the pre-job_id window of the upload itself.

Key Terms

Correlation ID → a per-request identifier propagated across services so a single trace stitches MCP, engine, and webhook spans together.
Audit log → mutation-event row written to audit_logs; not a metric and not a trace.
Health vs readiness → /health says “process is alive”, /ready says “dependencies are reachable”.

Q&A

Q: Does the MCP server emit Prometheus metrics? A: No. Adding a /metrics endpoint to the MCP server is an open gap. Today only the engine exposes Prometheus.

Q: Are HTTP requests counted? A: No traceo_http_requests_* metric exists. The only request-shaped signal is the rate-limiter’s X-RateLimit-Remaining header, which is per-IP and not surfaced to Prometheus.

Q: Where would correlation-ID propagation slot in? A: A header (e.g. X-Correlation-ID or W3C traceparent) read by both AuthMiddleware (engine) and the MCP routes/decorators, then placed in UserContext so logs and Sentry events tag every emission.

Examples

A failed CSV ingest today produces: a Sentry exception (engine), an audit row for the failed job, and unstructured stdlib log lines on stdout. None of the three carry a shared identifier — joining them requires correlating timestamps and job_id manually. Adding structured JSON logs with a correlation_id field would collapse the three signals into a single query.

neighbors on the map

FNP Observability & Prometheus Metrics monitoring FNP systems
Deployment Topology & Proxy Conflict Resolution setting up a new environment (kitten/cat/lion)