Substrates: local vs GHA¶
Sub-agent spawns route to one of two execution backends. The choice
is set per-kind in the manifest's substrate field, with a per-spawn
override via chad-spawn --substrate <local|gha>.
At a glance¶
| Property | local |
gha |
|---|---|---|
| Runs in | Chad's container | Fresh GitHub Actions runner |
| Isolation | Process-tree shared with parent | Per-spawn VM, no shared state |
| L7 network policy | Enforced (kind preset) | Lost — runner has wide egress |
| Inference reach | Local Nemotron + NVIDIA gateway | Public APIs only |
| gbrain access | Direct (CLI subprocess) | None (would need export/sync) |
| Async | No (sync only) | Yes (--async) |
| Cost | Operator's hardware | GHA minutes (free tier or paid) |
| Audit trail | subagents/<id>/, queue/tasks.jsonl |
Branch + workflow run history |
The two are complementary. Pick based on what the sub-agent needs:
- Use
localwhen the sub-agent depends on Chad's local state — brain queries, the/sandbox/sourceclone, embedded Nemotron inference, or anything that needs to be inside the L7 policy boundary. - Use
ghawhen the work is self-contained — a codex run that just needs OpenAI (or NVIDIA fallback) and a prompt, an opencode call against a public repo. The runner's wide egress is fine because the sub-agent doesn't carry sensitive state.
How gha substrate works¶
Inspired by the popebot agent-job pattern: every spawn becomes a
branch in tantodefi/chad-state plus a workflow trigger.
flowchart TD
A[chad-spawn] -->|push branch + manifest| B[chad-state repo]
A -->|workflow_dispatch| C[agent-job.yml]
C --> D[GHA runner]
D -->|checkout, install binary, run prompt| D
D -->|commit result.json| B
A -. sync mode .-> E[poll branch every 10s]
A -. async mode .-> F[return task_id now]
F --> G[chad-spawn-poll cron]
G -->|every 5 min| B
G -->|copy back, transition ledger| H[subagents directory]
H --> I[chad-collect --today]
The branch carries the task id in its name (e.g. chad-spawn/abc123).
Inside the branch, the spawn artifacts live at spawns/<task-id>/:
spawns/[task-id]/
├── prompt.txt # rendered prompt
├── task.{json|txt} # original input
├── kind.yaml # snapshot of the kind manifest
├── manifest.json # synthesized for the runner to read
├── stdout.log # ← runner writes
├── stderr.log # ← runner writes
└── result.json # ← runner writes (the structured result)
The branch is the job record. There's no jobs table, no queue
service. git log chad-spawn/<task_id> is the audit trail. Even if
Chad's local subagents directory is wiped, the result is durable on
chad-state.
Provider routing in the runner¶
The agent-job.yml workflow picks the inference provider at run time
based on which secret is set. This means a single
NVIDIA_API_KEY is enough to run codex/opencode kinds — no OpenAI
account required.
| Binary | Primary | Fallback | Notes |
|---|---|---|---|
codex |
OPENAI_API_KEY |
NVIDIA_API_KEY via OpenAI-compat |
Runner sets OPENAI_BASE_URL=https://integrate.api.nvidia.com/v1 and OPENAI_DEFAULT_MODEL=nvidia/nemotron-3-super-120b-a12b |
opencode |
OPENAI_API_KEY |
NVIDIA_API_KEY (same path) |
Same env override pattern |
claude |
ANTHROPIC_API_KEY |
none | Anthropic-only; ::error:: if missing |
| (unknown) | best-effort | NVIDIA preferred | Generic OpenAI-compat path tried |
The resolved provider is recorded in result.json so chad-collect
knows which backend produced each output.
codex's responses API
Modern codex uses OpenAI's responses API endpoint. NVIDIA's
OpenAI-compat endpoint may not fully implement that. If you hit
a binary-side error with the NVIDIA fallback, the fix is to set
OPENAI_API_KEY — no other change needed.
Async mode + the reconciler¶
chad-spawn --async --substrate gha returns the task_id
immediately and writes a running ledger entry. The
chad-spawn-poll cron (every 5 min) does the reconciliation:
- Scans the ledger for entries with
status: runningandsubstrate: gha. - For each, fetches the corresponding spawn branch.
- If
result.jsonexists: copies back to localsubagents/<id>/, transitions the ledger entry todone|failed. - If timed out (older than
kind_timeout + 600sby default): synthesizes afailedresult withexit_code: 124and a "did not commit within Ns" summary. - Calls
chad-collect --todayif anything reconciled, so results flow into today's memory automatically.
Branch retention¶
chad-spawn-gc (weekly Mon 02:30 UTC) deletes terminal-ized spawn
branches per retention policy:
- done branches: 7 days (recent successes audit-trail)
- failed branches: 30 days (longer for postmortem)
- queued / running branches: always kept (in flight)
- unknown (branch with no matching ledger entry): kept, flagged for operator review (state drift)
Without GC, the 24-spawn/day budget would accrue ~700 branches/month on chad-state.
Setting it up¶
The gha substrate needs one-time bootstrap on the operator host:
cd ~/.nemoclaw/source
./scripts/chad-state-bootstrap # PR-based install (recommended)
# OR
./scripts/chad-state-bootstrap --no-pr # commit straight to main
This clones chad-state, syncs scripts/chad-state-templates/* (the
agent-job.yml workflow), and opens a PR. After merge, set the
provider keys:
gh secret set NVIDIA_API_KEY --repo tantodefi/chad-state --body "$KEY"
gh secret set OPENAI_API_KEY --repo tantodefi/chad-state --body "$KEY" # optional
gh secret set ANTHROPIC_API_KEY --repo tantodefi/chad-state --body "$KEY" # optional
Then verify Chad's sandbox-side gh has push access to chad-state:
And smoke-test:
ssh openshell-chad 'echo "Write a haiku about a sandbox" > /tmp/t.md && \
chad-spawn --kind codex --task-file /tmp/t.md'
codex defaults to substrate: gha, so no --substrate flag needed.
A third substrate, eventually¶
A per-spawn k3s pod would give the isolation of GHA runners with the L7 policy enforcement of the local substrate — useful for sub-agents that want both. Not built; tracked on the Roadmap. The full design lives in the upstream design doc.