CRUMB a card from devarno-cloud

Build Job Worker Lifecycle

sparki intermediate 6 min read

ELI5

A worker is a chef on shift. It waits at the pass for the next ticket (build job), starts a 30-minute timer, cooks every step in order, and writes “success” or “failed” on the receipt. If anyone yells “kitchen closed!” (context cancelled), the chef stops cleanly.

Technical Deep Dive

internal/executor/worker.go defines Worker plus its Start, processJob, executePipeline, executeStep methods. Workers are spawned by the executor manager and read from a shared chan BuildJob.

Lifecycle

sequenceDiagram
autonumber
participant Q as jobQueue chan BuildJob
participant W as Worker
participant DB as repository.postgres
participant E as executor
participant D as docker.go
loop until ctx.Done or queue closed
Q-->>W: BuildJob
W->>DB: recordBuildStart(buildID)
W->>W: ctx, cancel := WithTimeout(ctx, 30*time.Minute)
loop for each step in PipelineConfig.Steps
W->>D: docker run command
D-->>W: stdout/stderr
W->>DB: append build log line
alt step error
W->>DB: recordBuildComplete(failed, err)
W-->>Q: next iteration
end
end
W->>DB: recordBuildComplete(success, "")
end
Note over W: ctx.Done -> log "worker stopping" and return

State View of One Job

stateDiagram-v2
[*] --> Received: BuildJob from chan
Received --> Started: recordBuildStart
Started --> Running: ctx WithTimeout 30m
Running --> Running: next step
Running --> Failed: step error
Running --> Succeeded: all steps ok
Running --> Failed: ctx deadline exceeded
Failed --> [*]: recordBuildComplete failed
Succeeded --> [*]: recordBuildComplete success

Hard Numbers

  • Context timeout per build: 30 minutes (context.WithTimeout(ctx, 30*time.Minute) in processJob).
  • Status strings written: "success" or "failed" (lowercase, plain).
  • Worker shutdown signal: ctx.Done() (graceful) OR closed jobQueue channel.
  • Logging: zap.Logger, with worker_id field bound at construction.

Failure Semantics

executePipeline runs steps in order; the first step error returns fmt.Errorf("step '%s' failed: %w", step.Name, err). The error message becomes the errorMsg written to the DB; downstream subsystems (score, websocket hub) read the row.

Key Terms

  • BuildJob → struct passed over jobQueue; carries BuildID, ProjectID, PipelineConfig
  • build context → 30-minute deadline created per-job, separate from worker lifetime
  • recordBuildStart / recordBuildComplete → executor methods that mutate the DB row
  • logBuild → executor helper that appends one line to the build log stream

Q&A

Q: Can a build run longer than 30 minutes? A: No. The context is created with WithTimeout(ctx, 30*time.Minute) per-job. When that fires, the running step’s exec is cancelled and the build is marked failed. There is no per-step extension of this deadline.

Q: What happens when jobQueue is closed? A: The select sees ok == false, the worker logs “job queue closed, worker stopping” and returns. In-flight jobs already past the receive complete normally; nothing buffered is lost because the channel was the only buffer.

Q: Is the recordBuildStart failure fatal? A: Effectively yes for that job — the worker logs the error and returns without executing the pipeline, so no build runs and no completion row is written.

Q: How are step retries from the YAML applied? A: The schema permits retry: 0..5, but the worker.executeStep path returns on the first error; per-step retry logic lives downstream in the runner layer (subsystems/run), not in worker.go itself.

Examples

Worker pool of 4: executor.Manager spawns 4 goroutines each running Worker.Start(ctx, jobQueue). RabbitMQ consumer (see sparki-005) pulls a BuildJobMessage off builds, the executor wraps it as a BuildJob and writes to jobQueue. The first idle worker takes it.

neighbors on the map