DLQ Retry & Replay Pipeline
skyflow intermediate 5 min read
ELI5
Every consumer has up to 3 attempts to ACK an event before NATS gives up. Event Router catches that final failure, copies the original bytes onto dlq.<original_subject> and into a dlq_events Postgres row, then triggers a Prometheus alert at >100 entries. A CLI tool re-publishes events back onto their original subject when the underlying bug is fixed.
Technical Deep Dive
Settings That Drive It
| Setting | Default | Source |
|---|---|---|
max_deliver | 3 | per-consumer config |
ack_wait | 60s | per-consumer config |
MAX_RETRIES env | 3 | event-router service env |
DLQ stream max_age | 30d | nats/topology.md § 6 |
Flow
flowchart TD P[Producer publishes event] --> S[(Stream)] S --> C[Consumer.Fetch] C --> H{handler ok?} H -->|ack| OK[done] H -->|nack/timeout| R{deliveries < max_deliver?} R -->|yes| C R -->|no| ER[event-router observes terminal failure] ER --> DP[publish dlq.<subject>] ER --> DB[INSERT dlq_events] ER --> AL[INSERT audit_log] DP --> DLQ[(DLQ stream, 30d)] DB --> A1[Prometheus dlq_events_count] A1 --> AL2{>100?} AL2 -->|yes| ALERT[HighDLQSize warning]dlq_events table
CREATE TABLE dlq_events ( event_id TEXT PRIMARY KEY, subject TEXT NOT NULL, data BYTEA NOT NULL, -- original protobuf error TEXT NOT NULL, retry_count INT DEFAULT 0, created_at TIMESTAMPTZ DEFAULT NOW(), last_retry_at TIMESTAMPTZ);Replay Sequence
sequenceDiagram autonumber participant Op as Operator participant CLI as event-router CLI participant PG as Postgres participant N as NATS
Op->>CLI: event-router replay --event-id evt_abc CLI->>PG: SELECT subject, data FROM dlq_events WHERE event_id=? CLI->>N: js.Publish(subject, data) CLI->>PG: UPDATE dlq_events SET retry_count=retry_count+1, last_retry_at=NOW() CLI-->>Op: replayedOther modes:
event-router replay --subject events.timeline.completedevent-router replay --since 1hMetric & Alert Surfaces
events_processed_total{subject,status} # status: success|failed|dlqevents_processing_duration_seconds{subject}dlq_events_count # gauge- alert: HighDLQSize expr: dlq_events_count > 100 for: 5m- alert: HighEventFailureRate expr: rate(events_processed_total{status="failed"}[5m]) > 0.01 for: 5mKey Terms
max_deliver→ JetStream consumer setting; deliveries beyond this are dropped (event-router observes the terminal NACK)ack_wait→ time NATS waits for ACK before redelivering; default 60s- DLQ stream → bound subject
dlq.>, file storage, 30-day retention audit_log→ every event (success or failure) is also logged here for compliance- Replay → re-publishes bytes onto the original subject; consumers see it as a normal event
Q&A
Q: After how many failures does an event hit the DLQ?
A: 3 (max_deliver=3 default). The 4th delivery would have happened — instead Event Router publishes to dlq.<subject> and the original event is gone from its source stream.
Q: Where is the original payload kept after DLQ?
A: Two places: the DLQ JetStream stream (30-day retention) and the dlq_events Postgres table (no auto-expiry). Replay reads from Postgres.
Q: Won’t a replay create duplicate side effects?
A: Consumers should be idempotent — most use BaseEvent.event_id to dedupe (e.g. Gamification checks xp_transactions.event_id). If a consumer is not idempotent, replaying a DLQ event will double-process; fix the consumer first.
Q: Why is Event Router’s role separate from each service’s own consumer? A: Centralised audit logging, consistent DLQ semantics, and one place to wire Prometheus/alerts. Per-service ad-hoc retry would make alert thresholds inconsistent.
Examples
Like a postal service’s “undeliverable mail” room. After three delivery attempts the parcel goes to the dead-letter office, photographed (audit_log) and shelved (dlq_events). When the recipient’s address is fixed, an operator pulls the parcel and drops it back into the normal sorting belt.
neighbors on the map
- EventEnvelope Wire Wrapper publishing a new domain event proto
- Run Outcome Classification interpreting a History row's status pill
- FNP Observability & Prometheus Metrics monitoring FNP systems