Build Job Worker Lifecycle

sparki intermediate 6 min read

ELI5

A worker is a chef on shift. It waits at the pass for the next ticket (build job), starts a 30-minute timer, cooks every step in order, and writes “success” or “failed” on the receipt. If anyone yells “kitchen closed!” (context cancelled), the chef stops cleanly.

Technical Deep Dive

internal/executor/worker.go defines Worker plus its Start, processJob, executePipeline, executeStep methods. Workers are spawned by the executor manager and read from a shared chan BuildJob.

Lifecycle

sequenceDiagram
  autonumber
  participant Q as jobQueue chan BuildJob
  participant W as Worker
  participant DB as repository.postgres
  participant E as executor
  participant D as docker.go

  loop until ctx.Done or queue closed
    Q-->>W: BuildJob
    W->>DB: recordBuildStart(buildID)
    W->>W: ctx, cancel := WithTimeout(ctx, 30*time.Minute)
    loop for each step in PipelineConfig.Steps
      W->>D: docker run command
      D-->>W: stdout/stderr
      W->>DB: append build log line
      alt step error
        W->>DB: recordBuildComplete(failed, err)
        W-->>Q: next iteration
      end
    end
    W->>DB: recordBuildComplete(success, "")
  end
  Note over W: ctx.Done -> log "worker stopping" and return

State View of One Job

stateDiagram-v2
  [*] --> Received: BuildJob from chan
  Received --> Started: recordBuildStart
  Started --> Running: ctx WithTimeout 30m
  Running --> Running: next step
  Running --> Failed: step error
  Running --> Succeeded: all steps ok
  Running --> Failed: ctx deadline exceeded
  Failed --> [*]: recordBuildComplete failed
  Succeeded --> [*]: recordBuildComplete success

Hard Numbers

Context timeout per build: 30 minutes (context.WithTimeout(ctx, 30*time.Minute) in processJob).
Status strings written: "success" or "failed" (lowercase, plain).
Worker shutdown signal: ctx.Done() (graceful) OR closed jobQueue channel.
Logging: zap.Logger, with worker_id field bound at construction.

Failure Semantics

executePipeline runs steps in order; the first step error returns fmt.Errorf("step '%s' failed: %w", step.Name, err). The error message becomes the errorMsg written to the DB; downstream subsystems (score, websocket hub) read the row.

Key Terms

BuildJob → struct passed over jobQueue; carries BuildID, ProjectID, PipelineConfig
build context → 30-minute deadline created per-job, separate from worker lifetime
recordBuildStart / recordBuildComplete → executor methods that mutate the DB row
logBuild → executor helper that appends one line to the build log stream

Q&A

Q: Can a build run longer than 30 minutes? A: No. The context is created with WithTimeout(ctx, 30*time.Minute) per-job. When that fires, the running step’s exec is cancelled and the build is marked failed. There is no per-step extension of this deadline.

Q: What happens when jobQueue is closed? A: The select sees ok == false, the worker logs “job queue closed, worker stopping” and returns. In-flight jobs already past the receive complete normally; nothing buffered is lost because the channel was the only buffer.

Q: Is the recordBuildStart failure fatal? A: Effectively yes for that job — the worker logs the error and returns without executing the pipeline, so no build runs and no completion row is written.

Q: How are step retries from the YAML applied? A: The schema permits retry: 0..5, but the worker.executeStep path returns on the first error; per-step retry logic lives downstream in the runner layer (subsystems/run), not in worker.go itself.

Examples

Worker pool of 4: executor.Manager spawns 4 goroutines each running Worker.Start(ctx, jobQueue). RabbitMQ consumer (see sparki-005) pulls a BuildJobMessage off builds, the executor wraps it as a BuildJob and writes to jobQueue. The first idle worker takes it.

neighbors on the map

End-to-End Chain Execution Request Flow tracing a chain execution through the entire system
Prompt-DAG Scheduler designing a graph.json for a new repo
Run Outcome Classification interpreting a History row's status pill