FNP Kubernetes Multi-Region Architecture
fnp advanced 7 min read
ELI5
FNP runs in multiple data centers across the world (AWS, GCP, Azure) so if one catches fire, the others keep working. New code is slowly rolled out to 5% of users first (canary), then 50%, then everyone — if something breaks, it only affects 5%, not everyone.
Technical Deep Dive
Multi-Region Deployment
Active-passive across 3 regions:
| Region | Role | RTO | RPO |
|---|---|---|---|
| US-East | Primary (active) | N/A | N/A |
| EU-Central | Secondary (warm standby) | ~30 seconds | ~5 seconds |
| APAC | Tertiary (cold standby) | ~120 seconds | ~15 seconds |
Network topology:
flowchart TB GLB["Global Load Balancer\n(geo-routing)"]
subgraph USE["US-East · Primary (active)"] USE_K["3–10 pods · auto-scaled"] USE_DB[("PostgreSQL 16\nwrite-ahead log")] end
subgraph EUC["EU-Central · Warm Standby"] EUC_K["1–3 pods · auto-scaled"] EUC_DB[("PostgreSQL 16\nread-only replica")] end
subgraph APAC["APAC · Cold Standby"] APAC_K["0–1 pods"] APAC_DB[("PostgreSQL 16\nread-only replica")] end
GLB -->|"primary traffic"| USE_K GLB -->|"failover / reads"| EUC_K GLB -->|"tertiary failover"| APAC_K
USE_DB -->|"async WAL stream\n~5s lag"| EUC_DB USE_DB -->|"async WAL stream\n~15s lag"| APAC_DBThe load balancer is positioned at the top to show it as the single entry point before traffic fans out to the three regions. Pods and databases are co-located inside region subgraphs to make it clear each region is a self-contained unit that could serve traffic independently. The async WAL arrows make the RPO window explicit — the replication lag labels are the numbers that determine data-loss exposure in a failover event.
Async PostgreSQL Replication
Primary (US-East):
PostgreSQL 16 (write-ahead log) ↓ (async replication)Primary → Secondary WAL streamingReplication lag:
- RPO (Recovery Point Objective): ~5 seconds
- Mechanism: Asynchronous streaming replication (doesn’t wait for ACK)
- Trade-off: Low latency for writes, small data loss window on failure
Secondary (EU-Central):
PostgreSQL 16 (read-only replica)Applies WAL from primary continuouslyServes read-only queries (for analytics)Canary Deployment (Flagger)
Gradual rollout:
flowchart TD S0["Version N-1 → 100% traffic"] S0 -->|"create canary"| S1["Version N → 5% traffic"] S1 -->|"observe 10 min"| D1{{"metrics healthy?"}} D1 -->|"no → rollback"| S0 D1 -->|"yes → proceed"| S2["Version N → 50% traffic"] S2 -->|"observe 5 min"| D2{{"metrics healthy?"}} D2 -->|"no → rollback"| S0 D2 -->|"yes → proceed"| S3["Version N → 100% traffic ✓"]The rollback arrows return to Version N-1 → 100% rather than to a dedicated rollback node because canary promotion is stateless — any failed gate instantly snaps traffic back to the previous stable version. The two distinct observation windows (10 min at 5%, 5 min at 50%) reflect asymmetric risk: the longer window at low traffic gives time for subtle failure modes to surface before wider blast radius.
Metrics monitored:
- P99 latency (< 200ms target)
- Error rate (< 0.1% target)
- HTTP 5xx (< 5 per minute)
Automatic rollback: If any metric exceeds threshold, canary is halted and traffic reverts to previous version.
Horizontal Pod Autoscaling (HPA)
Scaling policy:
CPU Utilization → target 60%- If CPU > 60%: scale up (+1 pod)- If CPU < 30%: scale down (-1 pod)- Min pods: 3, Max pods: 10 (per region)Calculation:
desired_replicas = current_replicas * (current_cpu / target_cpu)Example: 5 pods at 85% CPU → 5 * (85/60) = 7.08 pods → scale to 7RTO & RPO Targets
RTO (Recovery Time Objective): Time to recover service
flowchart TD F1["US-East failure detected"] F1 -->|"+5s · health check confirms"| F2["Failover to EU-Central initiated"] F2 -->|"+10s · DNS TTL propagation"| F3["Clients rerouted to EU-Central"] F3 -->|"+15s · pod startup & readiness"| F4["Service fully restored\nRTO ≈ 30 seconds"]The timeline is modelled as a sequential chain rather than a parallel diagram because DNS propagation must complete before clients can route, and service startup cannot be declared ready until pods pass readiness probes. The cumulative seconds on each edge make the 30-second RTO budget immediately auditable: each step’s contribution is visible at a glance.
RPO (Recovery Point Objective): Data loss on failure
flowchart LR W["Confirmed writes\n(US-East)"] -->|"async WAL stream"| LAG["Replication lag\n≈ 5 seconds"] LAG -->|"applied"| R["EU-Central replica\n(up to date)"] LAG -->|"⚠ failure window"| LOST["Up to 5s of writes\nnot yet replicated"]
style LOST fill:#f87171,color:#fffModelling the RPO as a branching path from the lag node makes it clear that the 5-second window is not a fixed loss — it is the maximum exposure. Under normal replication, writes land on the replica well within the lag budget; only if US-East fails precisely while WAL is in-flight does the worst-case loss materialise.
Improving RPO:
- Synchronous replication: RPO = 0 (no data loss), but higher latency
- Trade-off: FNP accepts 5-sec RPO for lower write latency
Key Terms
- RTO (Recovery Time Objective) → Maximum downtime tolerable
- RPO (Recovery Point Objective) → Maximum data loss tolerable
- Canary deployment → Gradual rollout starting at 5% traffic
- Async replication → Replica updated after primary confirms write (no wait for ACK)
- HPA → Horizontal Pod Autoscaler; scales based on metrics
Q&A
Q: Why not active-active across all regions? A: Write conflicts across regions are harder to resolve. Active-passive (single writer) is simpler and guarantees consistency. Async replication gives good RTO/RPO trade-offs.
Q: What if canary has a subtle bug that doesn’t affect P99 latency? A: Canary is monitored for 10 minutes before 50% rollout. Longer canary window detects more bugs. FNP uses 15+ metrics (not just latency), including error logs, to detect anomalies.
Q: How do clients find the right region? A: Global load balancer (Google Cloud Global Load Balancing or AWS CloudFront) routes based on geography. Clients in Asia hit APAC, Europe hits EU-Central, etc.
Examples
Multi-region deployment is like a food chain: primary region is the main restaurant, secondary is a branch in the next city (can open quickly if needed), tertiary is a food truck (very slow startup but available). Canary deployment is like a pilot program at one store before opening 100 franchises.
neighbors on the map
- FNP Cost Optimization & Karpenter optimizing cloud infrastructure costs
- FNP Observability & Prometheus Metrics monitoring FNP systems