Observability State & Gaps
traceo beginner 4 min read
ELI5
Traceo can tell you it’s alive (/health) and roughly how busy the engine is (/metrics), but it cannot yet tell you which user’s request caused which downstream call. The dashboards have outlines drawn but most of the lights aren’t wired up — this card is the wiring map of what’s lit and what’s dark.
Technical Deep Dive
Current State (per ARCHITECTURE_OVERVIEW.md §Observability Gaps)
| Pillar | What exists | What’s missing |
|---|---|---|
| Logging | stdlib loggers per module | structured/JSON formatters |
| Metrics (Engine) | /metrics Prometheus stub: uptime gauge, jobs-by-status gauge | HTTP request metrics, service-level metrics |
| Metrics (MCP) | — | /metrics endpoint absent entirely |
| Tracing | — | correlation IDs, distributed tracing, context propagation |
| Health | MCP /health, Engine /health + /ready | dependency checks are basic (DB + rust_backend only) |
| Errors | 15-type custom exception hierarchy with codes | global error formatter / handler |
What /metrics Emits Today
traceo_engine_uptime_seconds <gauge>traceo_jobs_total{status="pending|running|completed|failed"} <gauge>That is the entire surface. There are no traceo_http_requests_total, no histograms, no per-tool counters — alerts have to be authored against logs (which are unstructured) or against the existence of /health itself.
Hooks Already Wired
| Component | Source |
|---|---|
| Sentry SDK | traceo_mcp_server/sentry_integration.py, engine/src/engine/sentry_integration.py |
| Audit logging | services/audit.py + audit_logs table (migration 004_audit_retention.sql) |
| Per-module loggers | traceo_mcp_server/logging_config.py |
Sentry catches unhandled exceptions; audit logging is request-driven (writes one row per mutation) and is not a metrics replacement.
Mindmap of the Gap
mindmap root((Observability)) Logging stdlib loggers OK no JSON formatter no correlation id Metrics Engine /metrics basic MCP /metrics missing no HTTP histograms no tool counters Tracing no W3C traceparent no span propagation Health MCP /health Engine /health + /ready shallow dependency probes Errors custom exceptions ok error codes ok no global formatterWhy It Matters for the Job Pipeline
Without correlation IDs, a CSV upload that fails after the engine webhook reaches the MCP server cannot be traced end-to-end from a single log query. Today the operator has to grep both services by job_id (which only exists after the engine accepts the upload), missing the pre-job_id window of the upload itself.
Key Terms
- Correlation ID → a per-request identifier propagated across services so a single trace stitches MCP, engine, and webhook spans together.
- Audit log → mutation-event row written to
audit_logs; not a metric and not a trace. - Health vs readiness →
/healthsays “process is alive”,/readysays “dependencies are reachable”.
Q&A
Q: Does the MCP server emit Prometheus metrics?
A: No. Adding a /metrics endpoint to the MCP server is an open gap. Today only the engine exposes Prometheus.
Q: Are HTTP requests counted?
A: No traceo_http_requests_* metric exists. The only request-shaped signal is the rate-limiter’s X-RateLimit-Remaining header, which is per-IP and not surfaced to Prometheus.
Q: Where would correlation-ID propagation slot in?
A: A header (e.g. X-Correlation-ID or W3C traceparent) read by both AuthMiddleware (engine) and the MCP routes/decorators, then placed in UserContext so logs and Sentry events tag every emission.
Examples
A failed CSV ingest today produces: a Sentry exception (engine), an audit row for the failed job, and unstructured stdlib log lines on stdout. None of the three carry a shared identifier — joining them requires correlating timestamps and job_id manually. Adding structured JSON logs with a correlation_id field would collapse the three signals into a single query.
neighbors on the map
- FNP Observability & Prometheus Metrics monitoring FNP systems
- Deployment Topology & Proxy Conflict Resolution setting up a new environment (kitten/cat/lion)