CRUMB a card from devarno-cloud

OpenTelemetry Instrumentation & Metrics

iris intermediate 5 min read

ELI5

OpenTelemetry is like a fitness tracker for software. It counts how many requests came in (like steps), measures how long each took (like heart rate), and draws a map of where each request went (like GPS tracking). All this data is sent to dashboards where you can see if your app is healthy or struggling.

Technical Deep Dive

OTel Architecture in iris-service

flowchart LR
A["FastAPI App"] -->|Auto-instrumented| B["OTel SDK"]
B --> C["TracerProvider"]
B --> D["MeterProvider"]
C -->|Spans| E["BatchSpanProcessor"]
D -->|Metrics| F["PeriodicExportingMetricReader"]
E -->|OTLP gRPC| G["OTel Collector"]
F -->|OTLP gRPC| G
G -->|Traces| H["Jaeger"]
G -->|Metrics| I["Prometheus"]

Configuration (Environment Variables)

VariableDefaultDescription
OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317OTLP gRPC endpoint for the collector
OTEL_SERVICE_NAMEiris-serviceService name in traces and metrics
OTEL_SAMPLE_RATE0.1Trace sampling ratio (10% = 1 in 10 requests)
OTEL_ENABLEDtrueMaster switch to disable all telemetry
SERVICE_VERSION1.0.0Version tag on all telemetry

Resource Attributes

All telemetry is tagged with:

  • service.name = iris-service
  • service.version = 1.0.0
  • service.namespace = iris

Trace Instrumentation

sequenceDiagram
participant Client
participant FastAPI as FastAPI (auto-instrumented)
participant Router as API Router
participant Registry as SpriteRegistry
participant Engine as FingerprintEngine
Client->>FastAPI: POST /v1/sprites
FastAPI->>FastAPI: Create span: http.request
FastAPI->>Router: Route to create_sprite
Router->>Router: Create span: sprite_operation
Router->>Registry: create()
Registry->>Engine: compute_fingerprint()
Engine-->>Registry: hash
Registry-->>Router: Sprite
Router-->>FastAPI: 201
FastAPI-->>Client: Response
Note over FastAPI: Span exported to Jaeger<br/>via BatchSpanProcessor

FastAPI is auto-instrumented via FastAPIInstrumentor.instrument_app(app). Every HTTP request becomes a trace span with attributes for method, path, status code, and duration.

Custom Metrics

classDiagram
class MetricsRecorder {
+record_sprite_operation(op, name, status, duration, error)
+record_council_operation(op, domain, status, duration)
+record_chain_execution(status, duration, steps_count)
+record_gate_decision(gate_name, decision, duration)
}
class OperationTimer {
+duration_ms
+error
+__enter__()
+__exit__()
}
MetricsRecorder --> OperationTimer : uses

Counter metrics:

  • sprite_operations_total — by operation, status
  • council_operations_total — by operation, status
  • chain_executions_total — by status
  • chain_steps_executed_total — by status
  • gate_decisions_total — by gate_name, decision

Histogram metrics:

  • sprite_operation_duration_seconds — by operation, status
  • council_operation_duration_seconds — by operation, status
  • chain_execution_duration_seconds — by status
  • gate_evaluation_duration_seconds — by gate_name, decision

Trace Attributes

AttributeContextExample
sprite.nameSprite operationsSOL-FORGE
sprite.versionSprite operations1.0.0
operationAll operationscreate, update, execute
statusAll operationssuccess, error
errorError spansValidationError
duration_msTiming spans45
gate.nameGate evaluationscope_check
gate.decisionGate evaluationapproved, denied
chain.idChain executionuuid
step.nameStep executiongenerate_code

Sampling Strategy

flowchart LR
A["Incoming Request"] --> B["TraceIdRatioBased<br/>Sampler"]
B -->|Sample rate: 0.1| C["10% sampled"]
B -->|90% rejected| D["Not traced"]
C --> E["BatchSpanProcessor"]
E --> F["OTLP Exporter"]
  • Probabilistic sampling: TraceIdRatioBased at the configured sample rate
  • 10% default: 1 in 10 requests generates a full trace
  • Batched export: Spans are buffered and sent in batches to reduce overhead
  • 5-second flush: On shutdown, all buffered spans are force-flushed with a 5-second timeout

Lifecycle Management

# Startup (lifespan)
initialize_telemetry() # Creates TracerProvider + MeterProvider
instrument_app(app) # Auto-instruments FastAPI
# Shutdown (lifespan)
shutdown_telemetry() # Force-flushes with 5s timeout

Both operations are idempotent. Telemetry can be completely disabled via OTEL_ENABLED=false.

Key Terms

  • Trace → A directed acyclic graph of spans representing a single request’s path through the system
  • Span → A single operation within a trace (e.g., “HTTP POST /v1/sprites”, “compute fingerprint”)
  • Metric → A numeric measurement aggregated over time (counters, histograms, gauges)
  • OTLP → OpenTelemetry Protocol; gRPC-based transport for exporting telemetry
  • Sampler → Decides which traces to collect; IRIS uses probabilistic ratio-based sampling
  • BatchSpanProcessor → Buffers spans and exports them in batches for efficiency
  • Resource → Metadata attached to all telemetry from a service (name, version, namespace)

Q&A

Q: How do I increase trace sampling for debugging? A: Set OTEL_SAMPLE_RATE=1.0 to capture 100% of requests. Remember to reduce this in production to avoid overhead.

Q: Can I add custom spans to my code? A: Yes. Use get_tracer(name).start_as_current_span("my_operation") as a context manager. The tracer is configured globally after initialize_telemetry().

Q: What happens if the OTel Collector is down? A: The OTLP exporter has retry logic with exponential backoff. If the collector remains unavailable, spans are dropped (not buffered indefinitely to prevent memory growth).

Q: How do I correlate logs with traces? A: The request context middleware injects request_id (from X-Request-ID or X-Correlation-ID headers) into request.state. Include this ID in log entries for correlation.

Q: Are metrics persisted between restarts? A: No. Prometheus scrapes metrics from the running process. Historical metrics are stored in Prometheus’s time-series database (15-day retention in the Docker stack).

Examples

OpenTelemetry is like a city’s traffic management system:

  • Traces = Individual car GPS tracks showing exactly which streets they took, how long at each intersection, and where they stopped
  • Metrics = Traffic counters on each road (“45 cars per minute on Main Street”) and average travel times
  • Sampling = Only tracking 10% of cars to save money on GPS units, but still getting accurate traffic patterns
  • Jaeger = The traffic control centre where operators watch live GPS tracks
  • Prometheus = The city’s database of historical traffic volumes
  • Grafana = The big screens in city hall showing “Current congestion: moderate”

neighbors on the map