Build Job Worker Lifecycle
sparki intermediate 6 min read
ELI5
A worker is a chef on shift. It waits at the pass for the next ticket (build job), starts a 30-minute timer, cooks every step in order, and writes “success” or “failed” on the receipt. If anyone yells “kitchen closed!” (context cancelled), the chef stops cleanly.
Technical Deep Dive
internal/executor/worker.go defines Worker plus its Start, processJob, executePipeline, executeStep methods. Workers are spawned by the executor manager and read from a shared chan BuildJob.
Lifecycle
sequenceDiagram autonumber participant Q as jobQueue chan BuildJob participant W as Worker participant DB as repository.postgres participant E as executor participant D as docker.go
loop until ctx.Done or queue closed Q-->>W: BuildJob W->>DB: recordBuildStart(buildID) W->>W: ctx, cancel := WithTimeout(ctx, 30*time.Minute) loop for each step in PipelineConfig.Steps W->>D: docker run command D-->>W: stdout/stderr W->>DB: append build log line alt step error W->>DB: recordBuildComplete(failed, err) W-->>Q: next iteration end end W->>DB: recordBuildComplete(success, "") end Note over W: ctx.Done -> log "worker stopping" and returnState View of One Job
stateDiagram-v2 [*] --> Received: BuildJob from chan Received --> Started: recordBuildStart Started --> Running: ctx WithTimeout 30m Running --> Running: next step Running --> Failed: step error Running --> Succeeded: all steps ok Running --> Failed: ctx deadline exceeded Failed --> [*]: recordBuildComplete failed Succeeded --> [*]: recordBuildComplete successHard Numbers
- Context timeout per build: 30 minutes (
context.WithTimeout(ctx, 30*time.Minute)inprocessJob). - Status strings written:
"success"or"failed"(lowercase, plain). - Worker shutdown signal:
ctx.Done()(graceful) OR closedjobQueuechannel. - Logging:
zap.Logger, withworker_idfield bound at construction.
Failure Semantics
executePipeline runs steps in order; the first step error returns fmt.Errorf("step '%s' failed: %w", step.Name, err). The error message becomes the errorMsg written to the DB; downstream subsystems (score, websocket hub) read the row.
Key Terms
- BuildJob → struct passed over
jobQueue; carriesBuildID,ProjectID,PipelineConfig - build context → 30-minute deadline created per-job, separate from worker lifetime
- recordBuildStart / recordBuildComplete → executor methods that mutate the DB row
- logBuild → executor helper that appends one line to the build log stream
Q&A
Q: Can a build run longer than 30 minutes?
A: No. The context is created with WithTimeout(ctx, 30*time.Minute) per-job. When that fires, the running step’s exec is cancelled and the build is marked failed. There is no per-step extension of this deadline.
Q: What happens when jobQueue is closed?
A: The select sees ok == false, the worker logs “job queue closed, worker stopping” and returns. In-flight jobs already past the receive complete normally; nothing buffered is lost because the channel was the only buffer.
Q: Is the recordBuildStart failure fatal?
A: Effectively yes for that job — the worker logs the error and returns without executing the pipeline, so no build runs and no completion row is written.
Q: How are step retries from the YAML applied?
A: The schema permits retry: 0..5, but the worker.executeStep path returns on the first error; per-step retry logic lives downstream in the runner layer (subsystems/run), not in worker.go itself.
Examples
Worker pool of 4: executor.Manager spawns 4 goroutines each running Worker.Start(ctx, jobQueue). RabbitMQ consumer (see sparki-005) pulls a BuildJobMessage off builds, the executor wraps it as a BuildJob and writes to jobQueue. The first idle worker takes it.
neighbors on the map
- End-to-End Chain Execution Request Flow tracing a chain execution through the entire system
- Prompt-DAG Scheduler designing a graph.json for a new repo
- Run Outcome Classification interpreting a History row's status pill