FNP Cost Optimization & Karpenter

fnp intermediate 6 min read

ELI5

Cloud instances (servers) cost money. Spot instances are like buying airline tickets at the last minute — 80% cheaper, but the airline can cancel anytime. Karpenter is a tool that automatically buys cheap spot instances and replaces them before they’re cancelled, saving FNP 40-50% per month while keeping service reliable.

Technical Deep Dive

Spot vs On-Demand Pricing

Instance	Cost	Availability	Use in FNP
On-demand	$2/hour	Always	Production NodePool (3-10 pods)
Spot	$0.40/hour	95%+ uptime	Spot NodePool (10-30 pods)
Savings	80%	Trade-off	40-50% monthly savings

Monthly cost comparison:

100 pods on-demand: 100 * $2 * 730 hours = $146,000/month
Hybrid (30 on-demand + 70 spot):
  30 * $2 * 730 = $43,800
  70 * $0.40 * 730 = $20,440
  Total = $64,240/month
Savings: $146,000 - $64,240 = $81,760/month (56% reduction)

Karpenter Consolidation

Karpenter runs every hour:

1. List all pods and nodes
2. Calculate: which pods could fit on fewer nodes?
3. If consolidation is possible:
   - Create new nodes (cheaper)
   - Drain old nodes (graceful pod termination)
   - Delete empty nodes
4. Result: fewer total nodes, lower cost

Consolidation example:

Before:
- Node 1: [pod-a, pod-b] (60% utilized)
- Node 2: [pod-c] (20% utilized)
- Node 3: [pod-d, pod-e] (55% utilized)

After consolidation:
- Node 1: [pod-a, pod-b, pod-c, pod-d] (75% utilized)
- Node 2: [pod-e] (pending, will migrate)
- Nodes 2 & 3 deleted

Savings: 1/3 fewer nodes

Spot Eviction Handling

Scenario: AWS cancels a spot instance (maintenance or demand surge)

1. AWS sends 2-minute termination notice
2. Karpenter detects: node marked "cordoned" (no new pods)
3. Existing pods drained gracefully:
   - Send SIGTERM to pods (30-second grace period)
   - Pods save state to database
   - Pods terminate
4. Karpenter creates replacement pod on new instance
5. New pod resumes from saved state

RTO (recovery): ~30 seconds
Data loss: None (state persisted)

NodePool Configuration

Production NodePool (on-demand):

apiVersion: karpenter.sh/v1alpha5
kind: NodePool
metadata:
  name: production
spec:
  providerRef:
    name: on-demand
  limits:
    resources:
      cpu: 20
      memory: 100Gi
  consolidation:
    enabled: false  # Never consolidate production

Spot NodePool (interruptible):

apiVersion: karpenter.sh/v1alpha5
kind: NodePool
metadata:
  name: spot
spec:
  providerRef:
    name: spot
  limits:
    resources:
      cpu: 50
      memory: 200Gi
  consolidation:
    enabled: true  # Aggressively consolidate
  ttlSecondsAfterEmpty: 30  # Delete idle nodes after 30s

Key Terms

Spot instance → Unused cloud capacity sold at discount; can be reclaimed by cloud provider
Karpenter → Kubernetes-native autoscaler; binpacks pods onto nodes
Consolidation → Karpenter combines pods onto fewer nodes, deletes empty nodes
Cordoning → Mark node as “no new pods”; existing pods continue running

Q&A

Q: What if a pod is evicted mid-operation? A: FNP persists state to PostgreSQL every 1-2 seconds. Eviction triggers graceful shutdown (30-second SIGTERM). Pod saves final state, new pod resumes from last checkpoint.

Q: Can critical pods run on spot? A: Yes, if they’re stateless or can quickly recover. Health checks + readiness probes ensure bad pods are replaced. Karpenter respects pod disruption budgets (PDB).

Q: What’s the maximum savings? A: 50-60% is realistic with mixed on-demand + spot. 90% spot would be cheaper but riskier (higher eviction rate). FNP targets 40-50% savings with 99.9% availability.

Examples

Karpenter is like a warehouse manager: buying cheap containers (spot) when demand is high, consolidating inventory hourly to minimize storage costs, and keeping a buffer of expensive permanent containers (on-demand) for critical stock.