CRUMB a card from devarno-cloud

FNP Cost Optimization & Karpenter

fnp intermediate 6 min read

ELI5

Cloud instances (servers) cost money. Spot instances are like buying airline tickets at the last minute — 80% cheaper, but the airline can cancel anytime. Karpenter is a tool that automatically buys cheap spot instances and replaces them before they’re cancelled, saving FNP 40-50% per month while keeping service reliable.

Technical Deep Dive

Spot vs On-Demand Pricing

InstanceCostAvailabilityUse in FNP
On-demand$2/hourAlwaysProduction NodePool (3-10 pods)
Spot$0.40/hour95%+ uptimeSpot NodePool (10-30 pods)
Savings80%Trade-off40-50% monthly savings

Monthly cost comparison:

100 pods on-demand: 100 * $2 * 730 hours = $146,000/month
Hybrid (30 on-demand + 70 spot):
30 * $2 * 730 = $43,800
70 * $0.40 * 730 = $20,440
Total = $64,240/month
Savings: $146,000 - $64,240 = $81,760/month (56% reduction)

Karpenter Consolidation

Karpenter runs every hour:

1. List all pods and nodes
2. Calculate: which pods could fit on fewer nodes?
3. If consolidation is possible:
- Create new nodes (cheaper)
- Drain old nodes (graceful pod termination)
- Delete empty nodes
4. Result: fewer total nodes, lower cost

Consolidation example:

Before:
- Node 1: [pod-a, pod-b] (60% utilized)
- Node 2: [pod-c] (20% utilized)
- Node 3: [pod-d, pod-e] (55% utilized)
After consolidation:
- Node 1: [pod-a, pod-b, pod-c, pod-d] (75% utilized)
- Node 2: [pod-e] (pending, will migrate)
- Nodes 2 & 3 deleted
Savings: 1/3 fewer nodes

Spot Eviction Handling

Scenario: AWS cancels a spot instance (maintenance or demand surge)

1. AWS sends 2-minute termination notice
2. Karpenter detects: node marked "cordoned" (no new pods)
3. Existing pods drained gracefully:
- Send SIGTERM to pods (30-second grace period)
- Pods save state to database
- Pods terminate
4. Karpenter creates replacement pod on new instance
5. New pod resumes from saved state
RTO (recovery): ~30 seconds
Data loss: None (state persisted)

NodePool Configuration

Production NodePool (on-demand):

apiVersion: karpenter.sh/v1alpha5
kind: NodePool
metadata:
name: production
spec:
providerRef:
name: on-demand
limits:
resources:
cpu: 20
memory: 100Gi
consolidation:
enabled: false # Never consolidate production

Spot NodePool (interruptible):

apiVersion: karpenter.sh/v1alpha5
kind: NodePool
metadata:
name: spot
spec:
providerRef:
name: spot
limits:
resources:
cpu: 50
memory: 200Gi
consolidation:
enabled: true # Aggressively consolidate
ttlSecondsAfterEmpty: 30 # Delete idle nodes after 30s

Key Terms

  • Spot instance → Unused cloud capacity sold at discount; can be reclaimed by cloud provider
  • Karpenter → Kubernetes-native autoscaler; binpacks pods onto nodes
  • Consolidation → Karpenter combines pods onto fewer nodes, deletes empty nodes
  • Cordoning → Mark node as “no new pods”; existing pods continue running

Q&A

Q: What if a pod is evicted mid-operation? A: FNP persists state to PostgreSQL every 1-2 seconds. Eviction triggers graceful shutdown (30-second SIGTERM). Pod saves final state, new pod resumes from last checkpoint.

Q: Can critical pods run on spot? A: Yes, if they’re stateless or can quickly recover. Health checks + readiness probes ensure bad pods are replaced. Karpenter respects pod disruption budgets (PDB).

Q: What’s the maximum savings? A: 50-60% is realistic with mixed on-demand + spot. 90% spot would be cheaper but riskier (higher eviction rate). FNP targets 40-50% savings with 99.9% availability.

Examples

Karpenter is like a warehouse manager: buying cheap containers (spot) when demand is high, consolidating inventory hourly to minimize storage costs, and keeping a buffer of expensive permanent containers (on-demand) for critical stock.