CRUMB a card from devarno-cloud

Deployment Strategies & Rollback

sparki intermediate 5 min read

ELI5

When loco ships code, it picks one of four moving-truck plans: just swap the boxes (direct), set up a second house and forward the address (blue-green), open a side door for 5% of guests first (canary), or replace the furniture room by room (rolling). If the new place catches fire, the rollback flag drives the truck back.

Technical Deep Dive

Defined in subsystems/loco/types.go (Go, engine-side) and mirrored in services/deploy-loco/src/adapters/ (Rust, worker-side).

Strategies

ConstantStringDescription
StrategyDirectdirectStop old, start new. Cheapest, has downtime.
StrategyBlueGreenblue-greenStand up parallel stack, flip router.
StrategyCanarycanaryRoute a percentage of traffic, ramp on success.
StrategyRollingrollingReplace instances N at a time.

Platforms

ConstantStringAdapter
PlatformRailwayrailwayservices/deploy-loco/src/adapters/railway.rs
PlatformRenderrender(planned)
PlatformFlyIOflyio(planned)
PlatformVercelvercel(planned)
PlatformCustomcustomGeneric webhook + script adapter

Health Checks

TypeUse
HealthCheckHTTPGET an endpoint, expect 2xx
HealthCheckTCPopen a TCP port
HealthCheckScriptrun a script, exit code 0 = pass

Statuses: passing, warning, critical, unknown.

Decision Flow

flowchart TD
REQ[CreateDeploymentRequest] --> CFG{strategy?}
CFG -->|direct| D[stop old, start new]
CFG -->|blue-green| BG[provision green, run health checks, flip]
CFG -->|canary| CN[shift X% traffic, observe, ramp]
CFG -->|rolling| RL[replace instance batches]
D --> HC[health check]
BG --> HC
CN --> HC
RL --> HC
HC -->|passing| OK[status=success api=healthy]
HC -->|critical| FAIL[status=failed]
FAIL --> AR{auto_rollback?}
AR -->|true| RB[restore rollback_target_id]
AR -->|false| END[stay failed]
RB --> RBOK[status=rolled_back]

State of an Auto-Rollback

stateDiagram-v2
deploying --> health_check
health_check --> success: probes green
health_check --> failed: probes red
failed --> rolled_back: auto_rollback=true
rolled_back --> [*]

auto_rollback

DeploymentConfig.AutoRollback bool (Go) governs whether a failed health check triggers the rollback path. The Rust worker reads this from the row, looks up the previous successful deployment, and re-issues that adapter call. The original failed row’s rollback_target_id is set to that previous deployment.

Key Terms

  • canary → a controlled minority traffic shift used to detect regressions before full rollout
  • adapter → per-platform module under deploy-loco/src/adapters/ translating a DeploymentConfig to the platform’s API
  • health check → HTTP/TCP/script probe gating success
  • auto_rollback → boolean on the deployment config; on failure, restore the previous successful deployment

Q&A

Q: Which strategies require a load balancer in front of the service? A: blue-green, canary, and rolling all assume a router that can shift traffic between instances. direct does not — it accepts downtime in exchange.

Q: Does loco itself implement canary traffic shifting? A: No. Loco delegates to the platform adapter (e.g., Railway’s deploy API). Loco orchestrates the phases (current_phase) and reads health-check results; the platform owns the actual traffic split.

Q: What if rollback_target_id is null when auto-rollback fires? A: There is no previous successful deployment to restore (e.g., this is the first deploy). The worker leaves status at failed and surfaces an error rather than rolling back to nothing.

Examples

A canary to Railway: deploy-loco creates a new Railway deploy at 5% traffic, polls the platform for current_phase=canary, runs a 60s HTTP health check on the new pods. Probes pass → ramp to 50% → re-check → ramp to 100% → mark success/healthy. A 5xx spike during the 50% phase flips status to failed; if auto_rollback=true, Railway is asked to restore the prior deployment ID and the row records rolled_back.

neighbors on the map