DLQ Retry & Replay Pipeline

skyflow intermediate 5 min read

ELI5

Every consumer has up to 3 attempts to ACK an event before NATS gives up. Event Router catches that final failure, copies the original bytes onto dlq.<original_subject> and into a dlq_events Postgres row, then triggers a Prometheus alert at >100 entries. A CLI tool re-publishes events back onto their original subject when the underlying bug is fixed.

Technical Deep Dive

Settings That Drive It

Setting	Default	Source
`max_deliver`	3	per-consumer config
`ack_wait`	60s	per-consumer config
`MAX_RETRIES` env	3	event-router service env
DLQ stream `max_age`	30d	`nats/topology.md` § 6

Flow

flowchart TD
    P[Producer publishes event] --> S[(Stream)]
    S --> C[Consumer.Fetch]
    C --> H{handler ok?}
    H -->|ack| OK[done]
    H -->|nack/timeout| R{deliveries < max_deliver?}
    R -->|yes| C
    R -->|no| ER[event-router observes terminal failure]
    ER --> DP[publish dlq.<subject>]
    ER --> DB[INSERT dlq_events]
    ER --> AL[INSERT audit_log]
    DP --> DLQ[(DLQ stream, 30d)]
    DB --> A1[Prometheus dlq_events_count]
    A1 --> AL2{>100?}
    AL2 -->|yes| ALERT[HighDLQSize warning]

`dlq_events` table

CREATE TABLE dlq_events (
    event_id TEXT PRIMARY KEY,
    subject TEXT NOT NULL,
    data BYTEA NOT NULL,         -- original protobuf
    error TEXT NOT NULL,
    retry_count INT DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    last_retry_at TIMESTAMPTZ
);

Replay Sequence

sequenceDiagram
    autonumber
    participant Op as Operator
    participant CLI as event-router CLI
    participant PG as Postgres
    participant N as NATS

    Op->>CLI: event-router replay --event-id evt_abc
    CLI->>PG: SELECT subject, data FROM dlq_events WHERE event_id=?
    CLI->>N: js.Publish(subject, data)
    CLI->>PG: UPDATE dlq_events SET retry_count=retry_count+1, last_retry_at=NOW()
    CLI-->>Op: replayed

Other modes:

event-router replay --subject events.timeline.completed
event-router replay --since 1h

Metric & Alert Surfaces

events_processed_total{subject,status}        # status: success|failed|dlq
events_processing_duration_seconds{subject}
dlq_events_count                              # gauge

- alert: HighDLQSize
  expr: dlq_events_count > 100
  for: 5m
- alert: HighEventFailureRate
  expr: rate(events_processed_total{status="failed"}[5m]) > 0.01
  for: 5m

Key Terms

max_deliver → JetStream consumer setting; deliveries beyond this are dropped (event-router observes the terminal NACK)
ack_wait → time NATS waits for ACK before redelivering; default 60s
DLQ stream → bound subject dlq.>, file storage, 30-day retention
audit_log → every event (success or failure) is also logged here for compliance
Replay → re-publishes bytes onto the original subject; consumers see it as a normal event

Q&A

Q: After how many failures does an event hit the DLQ? A: 3 (max_deliver=3 default). The 4th delivery would have happened — instead Event Router publishes to dlq.<subject> and the original event is gone from its source stream.

Q: Where is the original payload kept after DLQ? A: Two places: the DLQ JetStream stream (30-day retention) and the dlq_events Postgres table (no auto-expiry). Replay reads from Postgres.

Q: Won’t a replay create duplicate side effects? A: Consumers should be idempotent — most use BaseEvent.event_id to dedupe (e.g. Gamification checks xp_transactions.event_id). If a consumer is not idempotent, replaying a DLQ event will double-process; fix the consumer first.

Q: Why is Event Router’s role separate from each service’s own consumer? A: Centralised audit logging, consistent DLQ semantics, and one place to wire Prometheus/alerts. Per-service ad-hoc retry would make alert thresholds inconsistent.

Examples

Like a postal service’s “undeliverable mail” room. After three delivery attempts the parcel goes to the dead-letter office, photographed (audit_log) and shelved (dlq_events). When the recipient’s address is fixed, an operator pulls the parcel and drops it back into the normal sorting belt.

neighbors on the map

EventEnvelope Wire Wrapper publishing a new domain event proto
Run Outcome Classification interpreting a History row's status pill
FNP Observability & Prometheus Metrics monitoring FNP systems