OpenTelemetry Instrumentation & Metrics

iris intermediate 5 min read

ELI5

OpenTelemetry is like a fitness tracker for software. It counts how many requests came in (like steps), measures how long each took (like heart rate), and draws a map of where each request went (like GPS tracking). All this data is sent to dashboards where you can see if your app is healthy or struggling.

Technical Deep Dive

OTel Architecture in iris-service

flowchart LR
    A["FastAPI App"] -->|Auto-instrumented| B["OTel SDK"]
    B --> C["TracerProvider"]
    B --> D["MeterProvider"]
    C -->|Spans| E["BatchSpanProcessor"]
    D -->|Metrics| F["PeriodicExportingMetricReader"]
    E -->|OTLP gRPC| G["OTel Collector"]
    F -->|OTLP gRPC| G
    G -->|Traces| H["Jaeger"]
    G -->|Metrics| I["Prometheus"]

Configuration (Environment Variables)

Variable	Default	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP gRPC endpoint for the collector
`OTEL_SERVICE_NAME`	`iris-service`	Service name in traces and metrics
`OTEL_SAMPLE_RATE`	`0.1`	Trace sampling ratio (10% = 1 in 10 requests)
`OTEL_ENABLED`	`true`	Master switch to disable all telemetry
`SERVICE_VERSION`	`1.0.0`	Version tag on all telemetry

Resource Attributes

All telemetry is tagged with:

service.name = iris-service
service.version = 1.0.0
service.namespace = iris

Trace Instrumentation

sequenceDiagram
    participant Client
    participant FastAPI as FastAPI (auto-instrumented)
    participant Router as API Router
    participant Registry as SpriteRegistry
    participant Engine as FingerprintEngine

    Client->>FastAPI: POST /v1/sprites
    FastAPI->>FastAPI: Create span: http.request
    FastAPI->>Router: Route to create_sprite
    Router->>Router: Create span: sprite_operation
    Router->>Registry: create()
    Registry->>Engine: compute_fingerprint()
    Engine-->>Registry: hash
    Registry-->>Router: Sprite
    Router-->>FastAPI: 201
    FastAPI-->>Client: Response
    Note over FastAPI: Span exported to Jaeger<br/>via BatchSpanProcessor

FastAPI is auto-instrumented via FastAPIInstrumentor.instrument_app(app). Every HTTP request becomes a trace span with attributes for method, path, status code, and duration.

Custom Metrics

classDiagram
    class MetricsRecorder {
        +record_sprite_operation(op, name, status, duration, error)
        +record_council_operation(op, domain, status, duration)
        +record_chain_execution(status, duration, steps_count)
        +record_gate_decision(gate_name, decision, duration)
    }
    class OperationTimer {
        +duration_ms
        +error
        +__enter__()
        +__exit__()
    }
    MetricsRecorder --> OperationTimer : uses

Counter metrics:

sprite_operations_total — by operation, status
council_operations_total — by operation, status
chain_executions_total — by status
chain_steps_executed_total — by status
gate_decisions_total — by gate_name, decision

Histogram metrics:

sprite_operation_duration_seconds — by operation, status
council_operation_duration_seconds — by operation, status
chain_execution_duration_seconds — by status
gate_evaluation_duration_seconds — by gate_name, decision

Trace Attributes

Attribute	Context	Example
`sprite.name`	Sprite operations	`SOL-FORGE`
`sprite.version`	Sprite operations	`1.0.0`
`operation`	All operations	`create`, `update`, `execute`
`status`	All operations	`success`, `error`
`error`	Error spans	`ValidationError`
`duration_ms`	Timing spans	`45`
`gate.name`	Gate evaluation	`scope_check`
`gate.decision`	Gate evaluation	`approved`, `denied`
`chain.id`	Chain execution	`uuid`
`step.name`	Step execution	`generate_code`

Sampling Strategy

flowchart LR
    A["Incoming Request"] --> B["TraceIdRatioBased<br/>Sampler"]
    B -->|Sample rate: 0.1| C["10% sampled"]
    B -->|90% rejected| D["Not traced"]
    C --> E["BatchSpanProcessor"]
    E --> F["OTLP Exporter"]

Probabilistic sampling: TraceIdRatioBased at the configured sample rate
10% default: 1 in 10 requests generates a full trace
Batched export: Spans are buffered and sent in batches to reduce overhead
5-second flush: On shutdown, all buffered spans are force-flushed with a 5-second timeout

Lifecycle Management

# Startup (lifespan)
initialize_telemetry()  # Creates TracerProvider + MeterProvider
instrument_app(app)     # Auto-instruments FastAPI

# Shutdown (lifespan)
shutdown_telemetry()    # Force-flushes with 5s timeout

Both operations are idempotent. Telemetry can be completely disabled via OTEL_ENABLED=false.

Key Terms

Trace → A directed acyclic graph of spans representing a single request’s path through the system
Span → A single operation within a trace (e.g., “HTTP POST /v1/sprites”, “compute fingerprint”)
Metric → A numeric measurement aggregated over time (counters, histograms, gauges)
OTLP → OpenTelemetry Protocol; gRPC-based transport for exporting telemetry
Sampler → Decides which traces to collect; IRIS uses probabilistic ratio-based sampling
BatchSpanProcessor → Buffers spans and exports them in batches for efficiency
Resource → Metadata attached to all telemetry from a service (name, version, namespace)

Q&A

Q: How do I increase trace sampling for debugging? A: Set OTEL_SAMPLE_RATE=1.0 to capture 100% of requests. Remember to reduce this in production to avoid overhead.

Q: Can I add custom spans to my code? A: Yes. Use get_tracer(name).start_as_current_span("my_operation") as a context manager. The tracer is configured globally after initialize_telemetry().

Q: What happens if the OTel Collector is down? A: The OTLP exporter has retry logic with exponential backoff. If the collector remains unavailable, spans are dropped (not buffered indefinitely to prevent memory growth).

Q: How do I correlate logs with traces? A: The request context middleware injects request_id (from X-Request-ID or X-Correlation-ID headers) into request.state. Include this ID in log entries for correlation.

Q: Are metrics persisted between restarts? A: No. Prometheus scrapes metrics from the running process. Historical metrics are stored in Prometheus’s time-series database (15-day retention in the Docker stack).

Examples

OpenTelemetry is like a city’s traffic management system:

Traces = Individual car GPS tracks showing exactly which streets they took, how long at each intersection, and where they stopped
Metrics = Traffic counters on each road (“45 cars per minute on Main Street”) and average travel times
Sampling = Only tracking 10% of cars to save money on GPS units, but still getting accurate traffic patterns
Jaeger = The traffic control centre where operators watch live GPS tracks
Prometheus = The city’s database of historical traffic volumes
Grafana = The big screens in city hall showing “Current congestion: moderate”

neighbors on the map

End-to-End Chain Execution Request Flow tracing a chain execution through the entire system
FNP Kubernetes Multi-Region Architecture deploying FNP across multiple regions