CRUMB a card from devarno-cloud

Fleet-Snapshot Workflow to CASA Dashboard

petrova intermediate 6 min read

ELI5

Once a day a robot reads every governed repo, builds a single JSON status report, and drops it on hubble’s doorstep. The CASA dashboard reads from that doorstep. If the doorstep is stale, the dashboard is stale.

Technical Deep Dive

The daily path

sequenceDiagram
autonumber
participant Cron as GitHub Actions cron (07:00 UTC)
participant CO as actions/checkout (no submodules)
participant CLI as cli/ (npm ci + build)
participant Doctor as node dist/index.js doctor --commit-state
participant State as state/<slug>.yaml
participant Dash as node dist/index.js dashboard --remote
participant Art as upload-artifact (snapshot.json)
participant Hubble as POST hubble/api/petrova/snapshot
participant CASA as CASA Petrova Fleet zone
Cron->>CO: trigger (or workflow_dispatch)
CO->>CO: checkout parent only — submodules are private
CO->>CLI: cd cli, npm ci, npm run build
CLI->>Doctor: doctor --repo .. --state-dir ../state --commit-state
Doctor->>State: refresh state/<slug>.yaml entries
Note over Doctor: non-zero exit logs ::warning:: but does not fail the job
CLI->>Dash: dashboard --remote --format json > snapshot.json
Note over Dash: GITHUB_TOKEN raises rate limit 60→5000 req/h
Dash->>Art: upload as fleet-snapshot artefact, retain 14 days
Dash->>Hubble: curl -X POST -H "authorization: Bearer $HUBBLE_INGEST_TOKEN"
alt 2xx
Hubble-->>CASA: snapshot live, dashboard refreshes
else non-2xx
Hubble-->>Dash: ::error::hubble ingest returned HTTP $status (job fails)
end

Trigger surface (.github/workflows/fleet-snapshot.yml)

on:
schedule:
- cron: '0 7 * * *' # 07:00 UTC daily
workflow_dispatch:
permissions:
contents: read
concurrency:
group: fleet-snapshot
cancel-in-progress: false

cancel-in-progress: false is deliberate — two snapshot runs racing would corrupt the state-commit window. Queueing is preferred over cancel.

Why submodules are skipped at checkout

- uses: actions/checkout@v4
# Submodules (core/prompts, core/templates) are private and not
# accessible via the workflow's default GITHUB_TOKEN.

The workflow needs registry.yaml + state/ + cli/, which all live in the parent repo. Pulling submodules would fail (private), the job would fail, and the snapshot would never run. Skipping them is the principled choice. (Commit 5a040f1 — fix(ci): fleet-snapshot workflow — skip private submodules introduced this fix.)

Doctor sweep — non-fatal

- name: Doctor sweep — refresh state/ for every locally-clonable repo
run: |
node dist/index.js doctor --repo .. --state-dir ../state --commit-state || {
echo "::warning::doctor exited non-zero; state file written but drift detected"
}

Doctor returning non-zero means drift was detected, not that the sweep failed. The state file is written either way; the warning surfaces in the run summary so an operator can investigate later. Failing the job here would block the snapshot post even when state is otherwise valid.

Snapshot summary line

After dashboard generation:

Terminal window
jq -r '" repo_count=\(.totals.repo_count)
open_milestones=\(.totals.open_milestones)
decisions_30d=\(.totals.decisions_30d)
warnings=\(.warnings | length)"' snapshot.json

The four-tuple (repos, open milestones, recent decisions, warnings) is the operator’s at-a-glance view in the workflow run logs. The CASA dashboard renders the same numbers per integration column.

Hubble ingest

Terminal window
status=$(curl --silent --show-error --output "$response" \
--write-out '%{http_code}' \
--max-time 30 \
--request POST "${HUBBLE_BASE_URL%/}/api/petrova/snapshot" \
--header "authorization: Bearer ${HUBBLE_INGEST_TOKEN}" \
--header 'content-type: application/json' \
--data-binary "@${GITHUB_WORKSPACE}/snapshot.json")
if [[ "$status" -lt 200 || "$status" -ge 300 ]]; then
echo "::error::hubble ingest returned HTTP $status"
exit 1
fi

Two failure modes the handler distinguishes:

  • Empty token → fail before curl: “::error::HUBBLE_INGEST_TOKEN secret is not set”. Fast-fail on missing config.
  • Non-2xx response → fail after curl with the actual status code surfaced. The body of the response is cat’d to the run log for diagnosis.

Required configuration

NameTypePurpose
HUBBLE_INGEST_TOKENsecretBearer token, same value as hubble’s EVENTS_INGEST_TOKEN. Rotation-aware.
HUBBLE_BASE_URLvarDefaults to https://hubble.devarno.cloud; override for staging.
GITHUB_TOKENinjectedLifts the GH API rate limit 60 → 5000 req/h on the --remote walker.

CASA correspondence

CASA’s “Petrova Fleet” zone reads from the same /api/petrova/snapshot endpoint hubble ingested into. The dashboard’s nine-repo grid + integration matrix (ARES/TRACEO/CRUMB/ROCKY/EVA columns) is rendered directly from the JSON shape the cli emits — no re-aggregation.

When the snapshot is stale

If CASA shows stale data, walk the path backward:

  1. Check the latest Fleet snapshot workflow run on petrova-hq.
  2. If the run failed at Post to hubble, check HUBBLE_INGEST_TOKEN rotation status against hubble’s current EVENTS_INGEST_TOKEN.
  3. If the run succeeded but CASA is stale, check hubble’s ingest endpoint logs.
  4. If Doctor sweep warning fired, drift exists in state/ — investigate the per-repo state file.

Key Terms

  • --remote walker — the cli’s mode that walks registry.yaml repos via the GH API instead of local clones; the only mode usable in CI.
  • Doctor sweep — refreshes per-repo state files; non-zero exit is a warning, not a job failure.
  • CASA Petrova Fleet zone — the dashboard surface that reads hubble’s stored snapshots.

Q&A

Q: Why does the workflow check out without submodules? A: The submodules (core/prompts, core/templates) are private repos. The default GITHUB_TOKEN cannot fetch them, and the snapshot generation needs only registry.yaml, state/, and cli/ — all parent-tracked. Skipping them is faster and prevents spurious fetch failures. (Fix landed 2026-05-06.)

Q: What two failure modes does the curl response handler distinguish? A: (1) HUBBLE_INGEST_TOKEN unset → fast-fail before curl with an explicit error and no network call. (2) Non-2xx HTTP status from hubble → fail after curl, surfacing the actual status code and the response body in the run log. Other failures (timeout, DNS) propagate via curl’s exit code.

Q: What concurrency setting prevents two snapshots from racing? A: concurrency: { group: fleet-snapshot, cancel-in-progress: false }. Two runs landing simultaneously would race on the --commit-state step. cancel-in-progress: false is deliberate — queue rather than cancel, so manual workflow_dispatch runs don’t kill an in-flight cron run.

Examples

Operator notices CASA’s KAHN row hasn’t refreshed. Click into the petrova-hq Actions tab — Fleet snapshot for 2026-05-06 shows green. The doctor sweep emitted a warning (“KAHN: ARES probe degraded”), the dashboard step ran, the artefact shows repo_count=9 open_milestones=4 decisions_30d=12 warnings=8. The hubble POST returned HTTP 200. The staleness is downstream of hubble; investigate hubble’s render path, not petrova-hq.

neighbors on the map