Fleet-Snapshot Workflow to CASA Dashboard
petrova intermediate 6 min read
ELI5
Once a day a robot reads every governed repo, builds a single JSON status report, and drops it on hubble’s doorstep. The CASA dashboard reads from that doorstep. If the doorstep is stale, the dashboard is stale.
Technical Deep Dive
The daily path
sequenceDiagram autonumber participant Cron as GitHub Actions cron (07:00 UTC) participant CO as actions/checkout (no submodules) participant CLI as cli/ (npm ci + build) participant Doctor as node dist/index.js doctor --commit-state participant State as state/<slug>.yaml participant Dash as node dist/index.js dashboard --remote participant Art as upload-artifact (snapshot.json) participant Hubble as POST hubble/api/petrova/snapshot participant CASA as CASA Petrova Fleet zone
Cron->>CO: trigger (or workflow_dispatch) CO->>CO: checkout parent only — submodules are private CO->>CLI: cd cli, npm ci, npm run build CLI->>Doctor: doctor --repo .. --state-dir ../state --commit-state Doctor->>State: refresh state/<slug>.yaml entries Note over Doctor: non-zero exit logs ::warning:: but does not fail the job CLI->>Dash: dashboard --remote --format json > snapshot.json Note over Dash: GITHUB_TOKEN raises rate limit 60→5000 req/h Dash->>Art: upload as fleet-snapshot artefact, retain 14 days Dash->>Hubble: curl -X POST -H "authorization: Bearer $HUBBLE_INGEST_TOKEN" alt 2xx Hubble-->>CASA: snapshot live, dashboard refreshes else non-2xx Hubble-->>Dash: ::error::hubble ingest returned HTTP $status (job fails) endTrigger surface (.github/workflows/fleet-snapshot.yml)
on: schedule: - cron: '0 7 * * *' # 07:00 UTC daily workflow_dispatch:
permissions: contents: read
concurrency: group: fleet-snapshot cancel-in-progress: falsecancel-in-progress: false is deliberate — two snapshot runs racing would corrupt the state-commit window. Queueing is preferred over cancel.
Why submodules are skipped at checkout
- uses: actions/checkout@v4 # Submodules (core/prompts, core/templates) are private and not # accessible via the workflow's default GITHUB_TOKEN.The workflow needs registry.yaml + state/ + cli/, which all live in the parent repo. Pulling submodules would fail (private), the job would fail, and the snapshot would never run. Skipping them is the principled choice. (Commit 5a040f1 — fix(ci): fleet-snapshot workflow — skip private submodules introduced this fix.)
Doctor sweep — non-fatal
- name: Doctor sweep — refresh state/ for every locally-clonable repo run: | node dist/index.js doctor --repo .. --state-dir ../state --commit-state || { echo "::warning::doctor exited non-zero; state file written but drift detected" }Doctor returning non-zero means drift was detected, not that the sweep failed. The state file is written either way; the warning surfaces in the run summary so an operator can investigate later. Failing the job here would block the snapshot post even when state is otherwise valid.
Snapshot summary line
After dashboard generation:
jq -r '" repo_count=\(.totals.repo_count) open_milestones=\(.totals.open_milestones) decisions_30d=\(.totals.decisions_30d) warnings=\(.warnings | length)"' snapshot.jsonThe four-tuple (repos, open milestones, recent decisions, warnings) is the operator’s at-a-glance view in the workflow run logs. The CASA dashboard renders the same numbers per integration column.
Hubble ingest
status=$(curl --silent --show-error --output "$response" \ --write-out '%{http_code}' \ --max-time 30 \ --request POST "${HUBBLE_BASE_URL%/}/api/petrova/snapshot" \ --header "authorization: Bearer ${HUBBLE_INGEST_TOKEN}" \ --header 'content-type: application/json' \ --data-binary "@${GITHUB_WORKSPACE}/snapshot.json")if [[ "$status" -lt 200 || "$status" -ge 300 ]]; then echo "::error::hubble ingest returned HTTP $status" exit 1fiTwo failure modes the handler distinguishes:
- Empty token → fail before curl: “::error::HUBBLE_INGEST_TOKEN secret is not set”. Fast-fail on missing config.
- Non-2xx response → fail after curl with the actual status code surfaced. The body of the response is
cat’d to the run log for diagnosis.
Required configuration
| Name | Type | Purpose |
|---|---|---|
HUBBLE_INGEST_TOKEN | secret | Bearer token, same value as hubble’s EVENTS_INGEST_TOKEN. Rotation-aware. |
HUBBLE_BASE_URL | var | Defaults to https://hubble.devarno.cloud; override for staging. |
GITHUB_TOKEN | injected | Lifts the GH API rate limit 60 → 5000 req/h on the --remote walker. |
CASA correspondence
CASA’s “Petrova Fleet” zone reads from the same /api/petrova/snapshot endpoint hubble ingested into. The dashboard’s nine-repo grid + integration matrix (ARES/TRACEO/CRUMB/ROCKY/EVA columns) is rendered directly from the JSON shape the cli emits — no re-aggregation.
When the snapshot is stale
If CASA shows stale data, walk the path backward:
- Check the latest
Fleet snapshotworkflow run on petrova-hq. - If the run failed at Post to hubble, check
HUBBLE_INGEST_TOKENrotation status against hubble’s currentEVENTS_INGEST_TOKEN. - If the run succeeded but CASA is stale, check hubble’s ingest endpoint logs.
- If
Doctor sweepwarning fired, drift exists instate/— investigate the per-repo state file.
Key Terms
--remotewalker — the cli’s mode that walksregistry.yamlrepos via the GH API instead of local clones; the only mode usable in CI.- Doctor sweep — refreshes per-repo state files; non-zero exit is a warning, not a job failure.
- CASA Petrova Fleet zone — the dashboard surface that reads hubble’s stored snapshots.
Q&A
Q: Why does the workflow check out without submodules?
A: The submodules (core/prompts, core/templates) are private repos. The default GITHUB_TOKEN cannot fetch them, and the snapshot generation needs only registry.yaml, state/, and cli/ — all parent-tracked. Skipping them is faster and prevents spurious fetch failures. (Fix landed 2026-05-06.)
Q: What two failure modes does the curl response handler distinguish?
A: (1) HUBBLE_INGEST_TOKEN unset → fast-fail before curl with an explicit error and no network call. (2) Non-2xx HTTP status from hubble → fail after curl, surfacing the actual status code and the response body in the run log. Other failures (timeout, DNS) propagate via curl’s exit code.
Q: What concurrency setting prevents two snapshots from racing?
A: concurrency: { group: fleet-snapshot, cancel-in-progress: false }. Two runs landing simultaneously would race on the --commit-state step. cancel-in-progress: false is deliberate — queue rather than cancel, so manual workflow_dispatch runs don’t kill an in-flight cron run.
Examples
Operator notices CASA’s KAHN row hasn’t refreshed. Click into the petrova-hq Actions tab — Fleet snapshot for 2026-05-06 shows green. The doctor sweep emitted a warning (“KAHN: ARES probe degraded”), the dashboard step ran, the artefact shows repo_count=9 open_milestones=4 decisions_30d=12 warnings=8. The hubble POST returned HTTP 200. The staleness is downstream of hubble; investigate hubble’s render path, not petrova-hq.
neighbors on the map
- NATS Subject Taxonomy wiring a new consumer to the right stream
- Graph Topology Snapshot authoring a new graph.json
- Presence Broadcast Channels diagnosing missing cursor updates in the editor