FNP Kubernetes Multi-Region Architecture

fnp advanced 7 min read

ELI5

FNP runs in multiple data centers across the world (AWS, GCP, Azure) so if one catches fire, the others keep working. New code is slowly rolled out to 5% of users first (canary), then 50%, then everyone — if something breaks, it only affects 5%, not everyone.

Technical Deep Dive

Multi-Region Deployment

Active-passive across 3 regions:

Region	Role	RTO	RPO
US-East	Primary (active)	N/A	N/A
EU-Central	Secondary (warm standby)	~30 seconds	~5 seconds
APAC	Tertiary (cold standby)	~120 seconds	~15 seconds

Network topology:

flowchart TB
    GLB["Global Load Balancer\n(geo-routing)"]

    subgraph USE["US-East · Primary (active)"]
        USE_K["3–10 pods · auto-scaled"]
        USE_DB[("PostgreSQL 16\nwrite-ahead log")]
    end

    subgraph EUC["EU-Central · Warm Standby"]
        EUC_K["1–3 pods · auto-scaled"]
        EUC_DB[("PostgreSQL 16\nread-only replica")]
    end

    subgraph APAC["APAC · Cold Standby"]
        APAC_K["0–1 pods"]
        APAC_DB[("PostgreSQL 16\nread-only replica")]
    end

    GLB -->|"primary traffic"| USE_K
    GLB -->|"failover / reads"| EUC_K
    GLB -->|"tertiary failover"| APAC_K

    USE_DB -->|"async WAL stream\n~5s lag"| EUC_DB
    USE_DB -->|"async WAL stream\n~15s lag"| APAC_DB

The load balancer is positioned at the top to show it as the single entry point before traffic fans out to the three regions. Pods and databases are co-located inside region subgraphs to make it clear each region is a self-contained unit that could serve traffic independently. The async WAL arrows make the RPO window explicit — the replication lag labels are the numbers that determine data-loss exposure in a failover event.

Async PostgreSQL Replication

Primary (US-East):

PostgreSQL 16 (write-ahead log)
 ↓ (async replication)
Primary → Secondary WAL streaming

Replication lag:

RPO (Recovery Point Objective): ~5 seconds
Mechanism: Asynchronous streaming replication (doesn’t wait for ACK)
Trade-off: Low latency for writes, small data loss window on failure

Secondary (EU-Central):

PostgreSQL 16 (read-only replica)
Applies WAL from primary continuously
Serves read-only queries (for analytics)

Canary Deployment (Flagger)

Gradual rollout:

flowchart TD
    S0["Version N-1 → 100% traffic"]
    S0 -->|"create canary"| S1["Version N → 5% traffic"]
    S1 -->|"observe 10 min"| D1{{"metrics healthy?"}}
    D1 -->|"no → rollback"| S0
    D1 -->|"yes → proceed"| S2["Version N → 50% traffic"]
    S2 -->|"observe 5 min"| D2{{"metrics healthy?"}}
    D2 -->|"no → rollback"| S0
    D2 -->|"yes → proceed"| S3["Version N → 100% traffic ✓"]

The rollback arrows return to Version N-1 → 100% rather than to a dedicated rollback node because canary promotion is stateless — any failed gate instantly snaps traffic back to the previous stable version. The two distinct observation windows (10 min at 5%, 5 min at 50%) reflect asymmetric risk: the longer window at low traffic gives time for subtle failure modes to surface before wider blast radius.

Metrics monitored:

P99 latency (< 200ms target)
Error rate (< 0.1% target)
HTTP 5xx (< 5 per minute)

Automatic rollback: If any metric exceeds threshold, canary is halted and traffic reverts to previous version.

Horizontal Pod Autoscaling (HPA)

Scaling policy:

CPU Utilization → target 60%
- If CPU > 60%: scale up (+1 pod)
- If CPU < 30%: scale down (-1 pod)
- Min pods: 3, Max pods: 10 (per region)

Calculation:

desired_replicas = current_replicas * (current_cpu / target_cpu)
Example: 5 pods at 85% CPU → 5 * (85/60) = 7.08 pods → scale to 7

RTO & RPO Targets

RTO (Recovery Time Objective): Time to recover service

flowchart TD
    F1["US-East failure detected"]
    F1 -->|"+5s · health check confirms"| F2["Failover to EU-Central initiated"]
    F2 -->|"+10s · DNS TTL propagation"| F3["Clients rerouted to EU-Central"]
    F3 -->|"+15s · pod startup & readiness"| F4["Service fully restored\nRTO ≈ 30 seconds"]

The timeline is modelled as a sequential chain rather than a parallel diagram because DNS propagation must complete before clients can route, and service startup cannot be declared ready until pods pass readiness probes. The cumulative seconds on each edge make the 30-second RTO budget immediately auditable: each step’s contribution is visible at a glance.

RPO (Recovery Point Objective): Data loss on failure

flowchart LR
    W["Confirmed writes\n(US-East)"] -->|"async WAL stream"| LAG["Replication lag\n≈ 5 seconds"]
    LAG -->|"applied"| R["EU-Central replica\n(up to date)"]
    LAG -->|"⚠ failure window"| LOST["Up to 5s of writes\nnot yet replicated"]

    style LOST fill:#f87171,color:#fff

Modelling the RPO as a branching path from the lag node makes it clear that the 5-second window is not a fixed loss — it is the maximum exposure. Under normal replication, writes land on the replica well within the lag budget; only if US-East fails precisely while WAL is in-flight does the worst-case loss materialise.

Improving RPO:

Synchronous replication: RPO = 0 (no data loss), but higher latency
Trade-off: FNP accepts 5-sec RPO for lower write latency

Key Terms

RTO (Recovery Time Objective) → Maximum downtime tolerable
RPO (Recovery Point Objective) → Maximum data loss tolerable
Canary deployment → Gradual rollout starting at 5% traffic
Async replication → Replica updated after primary confirms write (no wait for ACK)
HPA → Horizontal Pod Autoscaler; scales based on metrics

Q&A

Q: Why not active-active across all regions? A: Write conflicts across regions are harder to resolve. Active-passive (single writer) is simpler and guarantees consistency. Async replication gives good RTO/RPO trade-offs.

Q: What if canary has a subtle bug that doesn’t affect P99 latency? A: Canary is monitored for 10 minutes before 50% rollout. Longer canary window detects more bugs. FNP uses 15+ metrics (not just latency), including error logs, to detect anomalies.

Q: How do clients find the right region? A: Global load balancer (Google Cloud Global Load Balancing or AWS CloudFront) routes based on geography. Clients in Asia hit APAC, Europe hits EU-Central, etc.

Examples

Multi-region deployment is like a food chain: primary region is the main restaurant, secondary is a branch in the next city (can open quickly if needed), tertiary is a food truck (very slow startup but available). Canary deployment is like a pilot program at one store before opening 100 franchises.

neighbors on the map

FNP Cost Optimization & Karpenter optimizing cloud infrastructure costs
FNP Observability & Prometheus Metrics monitoring FNP systems