CRUMB a card from devarno-cloud

Site Provisioning Saga State Machine

choco advanced 6 min read

ELI5

A saga is a checklist for building a user’s documentation site. Every box ticked is one row update. If the worker dies between boxes, NATS redelivers the job and the checklist tells the next worker which box to start from instead of starting over.

Technical Deep Dive

Implementation: services/choco-gateway/internal/saga/provisioning.go. State strings mirror migration 000012’s CHECK constraint and the proto enum used in SiteProvisionStateChanged (proto/events/onboarding/v1/lifecycle.proto:178-186).

State Diagram

stateDiagram-v2
[*] --> requested
requested --> source_resolving
source_resolving --> source_resolved
source_resolving --> awaiting_github : no GitHub link
awaiting_github --> source_resolving : Start() re-entry
source_resolved --> vercel_creating
vercel_creating --> vercel_created
vercel_created --> hook_creating
hook_creating --> hook_created
hook_created --> live
requested --> failed : Fail()
source_resolving --> failed : Fail()
vercel_creating --> failed : Fail()
hook_creating --> failed : Fail()
live --> [*]
failed --> [*]

Terminal vs Non-Terminal

IsTerminal() returns true only for live and failed (provisioning.go:44-46). awaiting_github is not terminal — a subsequent Start() after the user links GitHub re-enters the state machine.

Atomic Transition Pattern

Transition() (provisioning.go:134-170) uses a CTE to capture the pre-update state alongside the write:

WITH prev AS (SELECT state AS prev_state FROM provisioning_sagas WHERE site_id = $1),
upd AS (UPDATE provisioning_sagas
SET state = $2::text, updated_at = NOW(),
last_error = CASE WHEN $2::text = 'failed' THEN last_error ELSE NULL END,
error_kind = CASE WHEN $2::text = 'failed' THEN error_kind ELSE NULL END
WHERE site_id = $1 AND state NOT IN ('live', 'failed')
RETURNING attempt, updated_at)
SELECT prev.prev_state, upd.attempt, upd.updated_at FROM prev, upd

Two non-obvious bits:

  • The WHERE state NOT IN ('live','failed') clause silently drops transitions out of terminal states (returns pgx.ErrNoRows), which the wrapper translates to “no non-terminal row to transition” (provisioning.go:158). This blocks zombie writes from a Vercel webhook arriving after a manual fail.
  • The explicit $2::text cast exists because pgx cannot infer the parameter type when the same parameter appears in both a SET and a CASE WHEN $2::text = 'failed' predicate (would default to unknown, SQLSTATE 42P08 — provisioning.go:144-146).

Attempt Counter

Start() does an INSERT ... ON CONFLICT (site_id) DO UPDATE SET attempt = attempt + 1 (provisioning.go:101-107). Retries are idempotent on state but increment attempt — stuck-saga alerting reasons about retry count, not state churn.

Correlation Preservation

On ON CONFLICT, correlation_id = COALESCE(provisioning_sagas.correlation_id, EXCLUDED.correlation_id) — the first correlation_id captured wins, so retries link traces back to the original attempt rather than fragmenting per redelivery.

Error Categorisation

error_kind{github_api, vercel_api, no_github_token, quota_exceeded, internal} (provisioning.go:50-57), 1:1 with the proto ProvisionErrorKind enum (lifecycle.proto:148-155). last_error is free-form; error_kind is the categorical alerting key.

Key Terms

  • saga → multi-step workflow with explicit state persistence; each step is independently retryable.
  • terminal statelive or failed; further transitions are silently rejected.
  • awaiting_github → distinguished pause state; not terminal, exits when the user links GitHub and a new Start() is called.

Q&A

Q: What happens if a Vercel webhook fires after the saga has been manually failed? A: Transition()’s WHERE clause excludes terminal states; the update affects zero rows and Scan returns pgx.ErrNoRows, surfaced as “no non-terminal row to transition” (provisioning.go:157-159).

Q: Why is last_error cleared on every non-failed transition? A: The CASE WHEN $2::text = 'failed' clause keeps the error fields populated only when transitioning to failed; on any forward transition they are nulled (provisioning.go:148-149) so a recovered saga doesn’t carry a stale error message.

Q: Why is the orchestrator extraction-ready? A: Package-level docs say it depends only on pgxpool (provisioning.go:13-14); when choco-forge is split out the consumer calls Transition() from its own NATS handler unchanged.

Examples

Happy path emits roughly six SiteProvisionStateChanged events: requested → source_resolving → source_resolved → vercel_creating → vercel_created → hook_creating → hook_created → live (lifecycle.proto:174-176 notes “~6 transitions per happy-path provision”). Subscribers selectively filter by to_state to avoid the high-frequency churn.

neighbors on the map