CRUMB a card from devarno-cloud

Docker Compose & Observability Stack

iris intermediate 6 min read

ELI5

The IRIS Docker stack is like a fully equipped hospital. The main building (iris-service) does the actual work. It connects to a filing cabinet (PostgreSQL) for long-term records, a security camera system (Jaeger) that tracks every person’s movements, a dashboard (Grafana) showing vital signs in real-time, a logbook (Loki) recording every conversation, and a central nurse station (OpenTelemetry Collector) that routes all information to the right places.

Technical Deep Dive

7-Service Docker Compose Stack

flowchart TB
subgraph network["iris-network"]
iris_app["iris-service FastAPI<br/>Port 8000"]
db["PostgreSQL 16<br/>5433 → 5432"]
otel["OTel Collector<br/>gRPC 4317 / HTTP 4318 / Metrics 8888"]
jaeger["Jaeger All-in-One<br/>16687 → 16686"]
prometheus["Prometheus<br/>9091 → 9090"]
grafana["Grafana<br/>3001 → 3000"]
loki["Loki<br/>3101 → 3100"]
end
iris_app --> db
iris_app --> otel
otel --> jaeger
otel --> prometheus
grafana --> prometheus
grafana --> jaeger
grafana --> loki

Service Details

ServiceImageHost PortPurposeDepends On
dbpostgres:16-alpine5433 → 5432Primary data store
otel-collectorotel/opentelemetry-collector:0.103.04317, 4318, 8888Telemetry routing pipelinejaeger
jaegerjaegertracing/all-in-one:latest16687 → 16686Distributed tracing UI
prometheusprom/prom/prometheus:latest9091 → 9090Metrics storage (15-day retention)
grafanagrafana/grafana:latest3001 → 3000Dashboards and visualisationprometheus, jaeger, loki
lokigrafana/loki:latest3101 → 3100Structured log aggregation
iris-serviceBuilt from ./Dockerfile8000Core FastAPI applicationAll above (health check)

Data Flow

flowchart LR
A["iris-service"] -->|Traces + Metrics<br/>OTLP gRPC| B["OTel Collector"]
A -->|Logs<br/>File/Console| C["Loki"]
B -->|Traces| D["Jaeger"]
B -->|Metrics| E["Prometheus"]
E --> F["Grafana"]
D --> F
C --> F
A -->|Data| G["PostgreSQL"]

Dependency Chain

flowchart TD
A["iris-service"] -->|condition: service_healthy| B["db"]
A -->|condition: service_healthy| C["otel-collector"]
A -->|condition: service_healthy| D["jaeger"]
A -->|condition: service_healthy| E["prometheus"]
A -->|condition: service_healthy| F["grafana"]
C -->|depends_on| D
F -->|reads from| E
F -->|reads from| D
F -->|reads from| G["loki"]

The iris-service container waits for all 5 infrastructure services to report healthy before starting. This ensures the database, telemetry pipeline, and observability backends are ready.

PostgreSQL Configuration

db:
image: postgres:16-alpine
environment:
POSTGRES_USER: iris
POSTGRES_PASSWORD: iris
POSTGRES_DB: iris_db
LANG: en_US.utf8
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U iris -d iris_db"]
interval: 5s
timeout: 5s
retries: 5

Note: While PostgreSQL is provisioned, iris-service currently uses in-memory registries. The SQLAlchemy ORM models and async database infrastructure are defined but init_db() is a placeholder.

Volumes

VolumeServicePurpose
postgres_datadbPersistent database storage
prometheus_dataprometheusMetrics time-series data
grafana_storagegrafanaDashboards and user preferences
loki_datalokiLog index and chunks

Network

All 7 services share a single Docker bridge network: iris-network. This enables DNS-based service discovery (e.g., db:5432, otel-collector:4317).

Key Terms

  • OTel Collector → Central telemetry pipeline receiving OTLP traces/metrics and routing to backends
  • Jaeger → Distributed tracing backend for visualising request flows across services
  • Prometheus → Time-series metrics database with 15-day retention
  • Grafana → Dashboard platform reading from Prometheus, Jaeger, and Loki
  • Loki → Log aggregation system optimised for structured logs from containers
  • Service health check → Docker condition ensuring dependent services are ready before starting
  • OTLP → OpenTelemetry Protocol; gRPC-based transport for traces and metrics

Q&A

Q: Why does iris-service expose port 8000 but db uses 5433? A: Port 5433 avoids conflicts with local PostgreSQL installations (which typically use 5432). Similarly, 9091, 3001, 3101, and 16687 avoid conflicts with locally running Prometheus, Grafana, Loki, and Jaeger instances.

Q: Is the database actually used? A: Not yet. The service uses in-memory registries. SQLAlchemy ORM models are defined and the async session factory is wired, but init_db() and close_db() are no-ops. Future versions will persist to PostgreSQL.

Q: How do I view traces? A: Open http://localhost:16687 for the Jaeger UI. Search by service name (iris-service), operation, or trace ID.

Q: How do I view metrics dashboards? A: Open http://localhost:3001 for Grafana. Default credentials are admin/admin. Prometheus is pre-configured as a datasource.

Q: What is the OTel sampling rate? A: Default 10% (OTEL_SAMPLE_RATE=0.1). In production, you might increase this for debugging or decrease it to reduce overhead.

Examples

The Docker stack is like a newsroom:

  • iris-service = The reporters writing stories (the actual work)
  • PostgreSQL = The filing cabinet storing published articles (permanent records)
  • OTel Collector = The editorial desk routing stories to the right departments
  • Jaeger = The security footage showing exactly who spoke to whom and when (request traces)
  • Prometheus = The analytics team tracking page views, errors, and response times (metrics)
  • Grafana = The wall of monitors in the newsroom showing live stats
  • Loki = The transcription service recording every phone call and meeting (logs)

neighbors on the map