CRUMB a card from devarno-cloud

DLQ Retry & Replay Pipeline

skyflow intermediate 5 min read

ELI5

Every consumer has up to 3 attempts to ACK an event before NATS gives up. Event Router catches that final failure, copies the original bytes onto dlq.<original_subject> and into a dlq_events Postgres row, then triggers a Prometheus alert at >100 entries. A CLI tool re-publishes events back onto their original subject when the underlying bug is fixed.

Technical Deep Dive

Settings That Drive It

SettingDefaultSource
max_deliver3per-consumer config
ack_wait60sper-consumer config
MAX_RETRIES env3event-router service env
DLQ stream max_age30dnats/topology.md § 6

Flow

flowchart TD
P[Producer publishes event] --> S[(Stream)]
S --> C[Consumer.Fetch]
C --> H{handler ok?}
H -->|ack| OK[done]
H -->|nack/timeout| R{deliveries < max_deliver?}
R -->|yes| C
R -->|no| ER[event-router observes terminal failure]
ER --> DP[publish dlq.<subject>]
ER --> DB[INSERT dlq_events]
ER --> AL[INSERT audit_log]
DP --> DLQ[(DLQ stream, 30d)]
DB --> A1[Prometheus dlq_events_count]
A1 --> AL2{>100?}
AL2 -->|yes| ALERT[HighDLQSize warning]

dlq_events table

CREATE TABLE dlq_events (
event_id TEXT PRIMARY KEY,
subject TEXT NOT NULL,
data BYTEA NOT NULL, -- original protobuf
error TEXT NOT NULL,
retry_count INT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
last_retry_at TIMESTAMPTZ
);

Replay Sequence

sequenceDiagram
autonumber
participant Op as Operator
participant CLI as event-router CLI
participant PG as Postgres
participant N as NATS
Op->>CLI: event-router replay --event-id evt_abc
CLI->>PG: SELECT subject, data FROM dlq_events WHERE event_id=?
CLI->>N: js.Publish(subject, data)
CLI->>PG: UPDATE dlq_events SET retry_count=retry_count+1, last_retry_at=NOW()
CLI-->>Op: replayed

Other modes:

Terminal window
event-router replay --subject events.timeline.completed
event-router replay --since 1h

Metric & Alert Surfaces

events_processed_total{subject,status} # status: success|failed|dlq
events_processing_duration_seconds{subject}
dlq_events_count # gauge
- alert: HighDLQSize
expr: dlq_events_count > 100
for: 5m
- alert: HighEventFailureRate
expr: rate(events_processed_total{status="failed"}[5m]) > 0.01
for: 5m

Key Terms

  • max_deliver → JetStream consumer setting; deliveries beyond this are dropped (event-router observes the terminal NACK)
  • ack_wait → time NATS waits for ACK before redelivering; default 60s
  • DLQ stream → bound subject dlq.>, file storage, 30-day retention
  • audit_log → every event (success or failure) is also logged here for compliance
  • Replay → re-publishes bytes onto the original subject; consumers see it as a normal event

Q&A

Q: After how many failures does an event hit the DLQ? A: 3 (max_deliver=3 default). The 4th delivery would have happened — instead Event Router publishes to dlq.<subject> and the original event is gone from its source stream.

Q: Where is the original payload kept after DLQ? A: Two places: the DLQ JetStream stream (30-day retention) and the dlq_events Postgres table (no auto-expiry). Replay reads from Postgres.

Q: Won’t a replay create duplicate side effects? A: Consumers should be idempotent — most use BaseEvent.event_id to dedupe (e.g. Gamification checks xp_transactions.event_id). If a consumer is not idempotent, replaying a DLQ event will double-process; fix the consumer first.

Q: Why is Event Router’s role separate from each service’s own consumer? A: Centralised audit logging, consistent DLQ semantics, and one place to wire Prometheus/alerts. Per-service ad-hoc retry would make alert thresholds inconsistent.

Examples

Like a postal service’s “undeliverable mail” room. After three delivery attempts the parcel goes to the dead-letter office, photographed (audit_log) and shelved (dlq_events). When the recipient’s address is fixed, an operator pulls the parcel and drops it back into the normal sorting belt.

neighbors on the map