Fleet-Snapshot Workflow to CASA Dashboard

petrova intermediate 6 min read

ELI5

Once a day a robot reads every governed repo, builds a single JSON status report, and drops it on hubble’s doorstep. The CASA dashboard reads from that doorstep. If the doorstep is stale, the dashboard is stale.

Technical Deep Dive

The daily path

sequenceDiagram
  autonumber
  participant Cron as GitHub Actions cron (07:00 UTC)
  participant CO as actions/checkout (no submodules)
  participant CLI as cli/ (npm ci + build)
  participant Doctor as node dist/index.js doctor --commit-state
  participant State as state/<slug>.yaml
  participant Dash as node dist/index.js dashboard --remote
  participant Art as upload-artifact (snapshot.json)
  participant Hubble as POST hubble/api/petrova/snapshot
  participant CASA as CASA Petrova Fleet zone

  Cron->>CO: trigger (or workflow_dispatch)
  CO->>CO: checkout parent only — submodules are private
  CO->>CLI: cd cli, npm ci, npm run build
  CLI->>Doctor: doctor --repo .. --state-dir ../state --commit-state
  Doctor->>State: refresh state/<slug>.yaml entries
  Note over Doctor: non-zero exit logs ::warning:: but does not fail the job
  CLI->>Dash: dashboard --remote --format json > snapshot.json
  Note over Dash: GITHUB_TOKEN raises rate limit 60→5000 req/h
  Dash->>Art: upload as fleet-snapshot artefact, retain 14 days
  Dash->>Hubble: curl -X POST -H "authorization: Bearer $HUBBLE_INGEST_TOKEN"
  alt 2xx
    Hubble-->>CASA: snapshot live, dashboard refreshes
  else non-2xx
    Hubble-->>Dash: ::error::hubble ingest returned HTTP $status (job fails)
  end

Trigger surface (`.github/workflows/fleet-snapshot.yml`)

on:
  schedule:
    - cron: '0 7 * * *'   # 07:00 UTC daily
  workflow_dispatch:

permissions:
  contents: read

concurrency:
  group: fleet-snapshot
  cancel-in-progress: false

cancel-in-progress: false is deliberate — two snapshot runs racing would corrupt the state-commit window. Queueing is preferred over cancel.

Why submodules are skipped at checkout

- uses: actions/checkout@v4
  # Submodules (core/prompts, core/templates) are private and not
  # accessible via the workflow's default GITHUB_TOKEN.

The workflow needs registry.yaml + state/ + cli/, which all live in the parent repo. Pulling submodules would fail (private), the job would fail, and the snapshot would never run. Skipping them is the principled choice. (Commit 5a040f1 — fix(ci): fleet-snapshot workflow — skip private submodules introduced this fix.)

Doctor sweep — non-fatal

- name: Doctor sweep — refresh state/ for every locally-clonable repo
  run: |
    node dist/index.js doctor --repo .. --state-dir ../state --commit-state || {
      echo "::warning::doctor exited non-zero; state file written but drift detected"
    }

Doctor returning non-zero means drift was detected, not that the sweep failed. The state file is written either way; the warning surfaces in the run summary so an operator can investigate later. Failing the job here would block the snapshot post even when state is otherwise valid.

Snapshot summary line

After dashboard generation:

jq -r '"  repo_count=\(.totals.repo_count)
            open_milestones=\(.totals.open_milestones)
            decisions_30d=\(.totals.decisions_30d)
            warnings=\(.warnings | length)"' snapshot.json

The four-tuple (repos, open milestones, recent decisions, warnings) is the operator’s at-a-glance view in the workflow run logs. The CASA dashboard renders the same numbers per integration column.

Hubble ingest

status=$(curl --silent --show-error --output "$response" \
              --write-out '%{http_code}' \
              --max-time 30 \
              --request POST "${HUBBLE_BASE_URL%/}/api/petrova/snapshot" \
              --header "authorization: Bearer ${HUBBLE_INGEST_TOKEN}" \
              --header 'content-type: application/json' \
              --data-binary "@${GITHUB_WORKSPACE}/snapshot.json")
if [[ "$status" -lt 200 || "$status" -ge 300 ]]; then
  echo "::error::hubble ingest returned HTTP $status"
  exit 1
fi

Two failure modes the handler distinguishes:

Empty token → fail before curl: “::error::HUBBLE_INGEST_TOKEN secret is not set”. Fast-fail on missing config.
Non-2xx response → fail after curl with the actual status code surfaced. The body of the response is cat’d to the run log for diagnosis.

Required configuration

Name	Type	Purpose
`HUBBLE_INGEST_TOKEN`	secret	Bearer token, same value as hubble’s `EVENTS_INGEST_TOKEN`. Rotation-aware.
`HUBBLE_BASE_URL`	var	Defaults to `https://hubble.devarno.cloud`; override for staging.
`GITHUB_TOKEN`	injected	Lifts the GH API rate limit 60 → 5000 req/h on the `--remote` walker.

CASA correspondence

CASA’s “Petrova Fleet” zone reads from the same /api/petrova/snapshot endpoint hubble ingested into. The dashboard’s nine-repo grid + integration matrix (ARES/TRACEO/CRUMB/ROCKY/EVA columns) is rendered directly from the JSON shape the cli emits — no re-aggregation.

When the snapshot is stale

If CASA shows stale data, walk the path backward:

Check the latest Fleet snapshot workflow run on petrova-hq.
If the run failed at Post to hubble, check HUBBLE_INGEST_TOKEN rotation status against hubble’s current EVENTS_INGEST_TOKEN.
If the run succeeded but CASA is stale, check hubble’s ingest endpoint logs.
If Doctor sweep warning fired, drift exists in state/ — investigate the per-repo state file.

Key Terms

--remote walker — the cli’s mode that walks registry.yaml repos via the GH API instead of local clones; the only mode usable in CI.
Doctor sweep — refreshes per-repo state files; non-zero exit is a warning, not a job failure.
CASA Petrova Fleet zone — the dashboard surface that reads hubble’s stored snapshots.

Q&A

Q: Why does the workflow check out without submodules? A: The submodules (core/prompts, core/templates) are private repos. The default GITHUB_TOKEN cannot fetch them, and the snapshot generation needs only registry.yaml, state/, and cli/ — all parent-tracked. Skipping them is faster and prevents spurious fetch failures. (Fix landed 2026-05-06.)

Q: What two failure modes does the curl response handler distinguish? A: (1) HUBBLE_INGEST_TOKEN unset → fast-fail before curl with an explicit error and no network call. (2) Non-2xx HTTP status from hubble → fail after curl, surfacing the actual status code and the response body in the run log. Other failures (timeout, DNS) propagate via curl’s exit code.

Q: What concurrency setting prevents two snapshots from racing? A: concurrency: { group: fleet-snapshot, cancel-in-progress: false }. Two runs landing simultaneously would race on the --commit-state step. cancel-in-progress: false is deliberate — queue rather than cancel, so manual workflow_dispatch runs don’t kill an in-flight cron run.

Examples

Operator notices CASA’s KAHN row hasn’t refreshed. Click into the petrova-hq Actions tab — Fleet snapshot for 2026-05-06 shows green. The doctor sweep emitted a warning (“KAHN: ARES probe degraded”), the dashboard step ran, the artefact shows repo_count=9 open_milestones=4 decisions_30d=12 warnings=8. The hubble POST returned HTTP 200. The staleness is downstream of hubble; investigate hubble’s render path, not petrova-hq.

neighbors on the map

NATS Subject Taxonomy wiring a new consumer to the right stream
Graph Topology Snapshot authoring a new graph.json
Presence Broadcast Channels diagnosing missing cursor updates in the editor