CRUMB a card from devarno-cloud

Observability State & Gaps

traceo beginner 4 min read

ELI5

Traceo can tell you it’s alive (/health) and roughly how busy the engine is (/metrics), but it cannot yet tell you which user’s request caused which downstream call. The dashboards have outlines drawn but most of the lights aren’t wired up — this card is the wiring map of what’s lit and what’s dark.

Technical Deep Dive

Current State (per ARCHITECTURE_OVERVIEW.md §Observability Gaps)

PillarWhat existsWhat’s missing
Loggingstdlib loggers per modulestructured/JSON formatters
Metrics (Engine)/metrics Prometheus stub: uptime gauge, jobs-by-status gaugeHTTP request metrics, service-level metrics
Metrics (MCP)/metrics endpoint absent entirely
Tracingcorrelation IDs, distributed tracing, context propagation
HealthMCP /health, Engine /health + /readydependency checks are basic (DB + rust_backend only)
Errors15-type custom exception hierarchy with codesglobal error formatter / handler

What /metrics Emits Today

traceo_engine_uptime_seconds <gauge>
traceo_jobs_total{status="pending|running|completed|failed"} <gauge>

That is the entire surface. There are no traceo_http_requests_total, no histograms, no per-tool counters — alerts have to be authored against logs (which are unstructured) or against the existence of /health itself.

Hooks Already Wired

ComponentSource
Sentry SDKtraceo_mcp_server/sentry_integration.py, engine/src/engine/sentry_integration.py
Audit loggingservices/audit.py + audit_logs table (migration 004_audit_retention.sql)
Per-module loggerstraceo_mcp_server/logging_config.py

Sentry catches unhandled exceptions; audit logging is request-driven (writes one row per mutation) and is not a metrics replacement.

Mindmap of the Gap

mindmap
root((Observability))
Logging
stdlib loggers OK
no JSON formatter
no correlation id
Metrics
Engine /metrics basic
MCP /metrics missing
no HTTP histograms
no tool counters
Tracing
no W3C traceparent
no span propagation
Health
MCP /health
Engine /health + /ready
shallow dependency probes
Errors
custom exceptions ok
error codes ok
no global formatter

Why It Matters for the Job Pipeline

Without correlation IDs, a CSV upload that fails after the engine webhook reaches the MCP server cannot be traced end-to-end from a single log query. Today the operator has to grep both services by job_id (which only exists after the engine accepts the upload), missing the pre-job_id window of the upload itself.

Key Terms

  • Correlation ID → a per-request identifier propagated across services so a single trace stitches MCP, engine, and webhook spans together.
  • Audit log → mutation-event row written to audit_logs; not a metric and not a trace.
  • Health vs readiness/health says “process is alive”, /ready says “dependencies are reachable”.

Q&A

Q: Does the MCP server emit Prometheus metrics? A: No. Adding a /metrics endpoint to the MCP server is an open gap. Today only the engine exposes Prometheus.

Q: Are HTTP requests counted? A: No traceo_http_requests_* metric exists. The only request-shaped signal is the rate-limiter’s X-RateLimit-Remaining header, which is per-IP and not surfaced to Prometheus.

Q: Where would correlation-ID propagation slot in? A: A header (e.g. X-Correlation-ID or W3C traceparent) read by both AuthMiddleware (engine) and the MCP routes/decorators, then placed in UserContext so logs and Sentry events tag every emission.

Examples

A failed CSV ingest today produces: a Sentry exception (engine), an audit row for the failed job, and unstructured stdlib log lines on stdout. None of the three carry a shared identifier — joining them requires correlating timestamps and job_id manually. Adding structured JSON logs with a correlation_id field would collapse the three signals into a single query.

neighbors on the map