OpenTelemetry Instrumentation & Metrics
iris intermediate 5 min read
ELI5
OpenTelemetry is like a fitness tracker for software. It counts how many requests came in (like steps), measures how long each took (like heart rate), and draws a map of where each request went (like GPS tracking). All this data is sent to dashboards where you can see if your app is healthy or struggling.
Technical Deep Dive
OTel Architecture in iris-service
flowchart LR A["FastAPI App"] -->|Auto-instrumented| B["OTel SDK"] B --> C["TracerProvider"] B --> D["MeterProvider"] C -->|Spans| E["BatchSpanProcessor"] D -->|Metrics| F["PeriodicExportingMetricReader"] E -->|OTLP gRPC| G["OTel Collector"] F -->|OTLP gRPC| G G -->|Traces| H["Jaeger"] G -->|Metrics| I["Prometheus"]Configuration (Environment Variables)
| Variable | Default | Description |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP gRPC endpoint for the collector |
OTEL_SERVICE_NAME | iris-service | Service name in traces and metrics |
OTEL_SAMPLE_RATE | 0.1 | Trace sampling ratio (10% = 1 in 10 requests) |
OTEL_ENABLED | true | Master switch to disable all telemetry |
SERVICE_VERSION | 1.0.0 | Version tag on all telemetry |
Resource Attributes
All telemetry is tagged with:
service.name=iris-serviceservice.version=1.0.0service.namespace=iris
Trace Instrumentation
sequenceDiagram participant Client participant FastAPI as FastAPI (auto-instrumented) participant Router as API Router participant Registry as SpriteRegistry participant Engine as FingerprintEngine
Client->>FastAPI: POST /v1/sprites FastAPI->>FastAPI: Create span: http.request FastAPI->>Router: Route to create_sprite Router->>Router: Create span: sprite_operation Router->>Registry: create() Registry->>Engine: compute_fingerprint() Engine-->>Registry: hash Registry-->>Router: Sprite Router-->>FastAPI: 201 FastAPI-->>Client: Response Note over FastAPI: Span exported to Jaeger<br/>via BatchSpanProcessorFastAPI is auto-instrumented via FastAPIInstrumentor.instrument_app(app). Every HTTP request becomes a trace span with attributes for method, path, status code, and duration.
Custom Metrics
classDiagram class MetricsRecorder { +record_sprite_operation(op, name, status, duration, error) +record_council_operation(op, domain, status, duration) +record_chain_execution(status, duration, steps_count) +record_gate_decision(gate_name, decision, duration) } class OperationTimer { +duration_ms +error +__enter__() +__exit__() } MetricsRecorder --> OperationTimer : usesCounter metrics:
sprite_operations_total— byoperation,statuscouncil_operations_total— byoperation,statuschain_executions_total— bystatuschain_steps_executed_total— bystatusgate_decisions_total— bygate_name,decision
Histogram metrics:
sprite_operation_duration_seconds— byoperation,statuscouncil_operation_duration_seconds— byoperation,statuschain_execution_duration_seconds— bystatusgate_evaluation_duration_seconds— bygate_name,decision
Trace Attributes
| Attribute | Context | Example |
|---|---|---|
sprite.name | Sprite operations | SOL-FORGE |
sprite.version | Sprite operations | 1.0.0 |
operation | All operations | create, update, execute |
status | All operations | success, error |
error | Error spans | ValidationError |
duration_ms | Timing spans | 45 |
gate.name | Gate evaluation | scope_check |
gate.decision | Gate evaluation | approved, denied |
chain.id | Chain execution | uuid |
step.name | Step execution | generate_code |
Sampling Strategy
flowchart LR A["Incoming Request"] --> B["TraceIdRatioBased<br/>Sampler"] B -->|Sample rate: 0.1| C["10% sampled"] B -->|90% rejected| D["Not traced"] C --> E["BatchSpanProcessor"] E --> F["OTLP Exporter"]- Probabilistic sampling:
TraceIdRatioBasedat the configured sample rate - 10% default: 1 in 10 requests generates a full trace
- Batched export: Spans are buffered and sent in batches to reduce overhead
- 5-second flush: On shutdown, all buffered spans are force-flushed with a 5-second timeout
Lifecycle Management
# Startup (lifespan)initialize_telemetry() # Creates TracerProvider + MeterProviderinstrument_app(app) # Auto-instruments FastAPI
# Shutdown (lifespan)shutdown_telemetry() # Force-flushes with 5s timeoutBoth operations are idempotent. Telemetry can be completely disabled via OTEL_ENABLED=false.
Key Terms
- Trace → A directed acyclic graph of spans representing a single request’s path through the system
- Span → A single operation within a trace (e.g., “HTTP POST /v1/sprites”, “compute fingerprint”)
- Metric → A numeric measurement aggregated over time (counters, histograms, gauges)
- OTLP → OpenTelemetry Protocol; gRPC-based transport for exporting telemetry
- Sampler → Decides which traces to collect; IRIS uses probabilistic ratio-based sampling
- BatchSpanProcessor → Buffers spans and exports them in batches for efficiency
- Resource → Metadata attached to all telemetry from a service (name, version, namespace)
Q&A
Q: How do I increase trace sampling for debugging?
A: Set OTEL_SAMPLE_RATE=1.0 to capture 100% of requests. Remember to reduce this in production to avoid overhead.
Q: Can I add custom spans to my code?
A: Yes. Use get_tracer(name).start_as_current_span("my_operation") as a context manager. The tracer is configured globally after initialize_telemetry().
Q: What happens if the OTel Collector is down? A: The OTLP exporter has retry logic with exponential backoff. If the collector remains unavailable, spans are dropped (not buffered indefinitely to prevent memory growth).
Q: How do I correlate logs with traces?
A: The request context middleware injects request_id (from X-Request-ID or X-Correlation-ID headers) into request.state. Include this ID in log entries for correlation.
Q: Are metrics persisted between restarts? A: No. Prometheus scrapes metrics from the running process. Historical metrics are stored in Prometheus’s time-series database (15-day retention in the Docker stack).
Examples
OpenTelemetry is like a city’s traffic management system:
- Traces = Individual car GPS tracks showing exactly which streets they took, how long at each intersection, and where they stopped
- Metrics = Traffic counters on each road (“45 cars per minute on Main Street”) and average travel times
- Sampling = Only tracking 10% of cars to save money on GPS units, but still getting accurate traffic patterns
- Jaeger = The traffic control centre where operators watch live GPS tracks
- Prometheus = The city’s database of historical traffic volumes
- Grafana = The big screens in city hall showing “Current congestion: moderate”
neighbors on the map
- End-to-End Chain Execution Request Flow tracing a chain execution through the entire system
- FNP Kubernetes Multi-Region Architecture deploying FNP across multiple regions