Architecture¶
Chad is a small amount of glue across a stack of things that already work. The interesting design is in the boundaries: where Chad's process ends and OpenShell's gateway begins; where a sub-agent's filesystem ends and the GitHub Actions runner's begins; where the autonomy boundary sits relative to each external action.
The concentric rings (plus an outer supervisor tier)¶
A naked CLI agent has no blast-radius story. Chad runs inside three boundaries, each with a different threat model — plus a fourth supervisor tier that lives outside the pod and watches it from the host so token-expensive supervision doesn't run inside the agent layer.
graph TB
subgraph L0["L0 · Host (operator machine)"]
H1[nemoclaw CLI]
H2[Cloudflare tunnel]
H3[Credentials + SSH keys]
H4[chad-tunnel<br/>SSH local-forward<br/>:8901 → chad-shim]
end
subgraph L1["L1 · Host supervisors (launchd, 5-min cadence)"]
W1[chad-gateway-watchdog<br/>OOM-recover gateway<br/>with 4 GB heap]
W2[chad-shim-watchdog<br/>restart shim on death]
W3[chad-spawn-poll<br/>reconcile GHA spawns]
W4[chad-webui-ingest<br/>chats → gbrain]
W5[nvidia-liveness<br/>model dropdown sweep]
end
subgraph L2["L2 · Container (OpenShell pod)"]
C1[Gateway user · capsh drops]
C2[L7 proxy + OPA policy<br/>MITM TLS for inspect hosts]
C3[L1 → L2 SSH ingress<br/>only path in]
end
subgraph L3["L3 · Sandbox (sandbox user, uid 998)"]
direction TB
S0[openclaw gateway<br/>port 18789]
S1[Chad main agent<br/>+ 12 standing crons]
S2[chad-shim.py<br/>port 8901<br/>OpenAI bridge]
S3[Sub-agents<br/>local + gha]
S4[Memory stores<br/>gbrain · lancedb · wiki]
S5["bin/ shims<br/>(wrapper-bug fixes)"]
end
subgraph L4["L4 · Per-spawn breakout"]
G1[GitHub Actions runner<br/>fresh VM per workflow]
end
L0 --> L2
L1 -.supervises.-> L2
L1 -.supervises.-> L3
L2 --> L3
L3 -.spawns.-> L4
L3 -.event-stream.-> L1
classDef host fill:#fef3c7,stroke:#d97706,color:#000
classDef watch fill:#fde68a,stroke:#b45309,color:#000
classDef container fill:#ddd6fe,stroke:#7c3aed,color:#000
classDef sandbox fill:#c7d2fe,stroke:#4f46e5,color:#000
classDef gha fill:#fecaca,stroke:#dc2626,color:#000
class L0 host
class L1 watch
class L2 container
class L3 sandbox
class L4 gha
| Tier | What lives here | What enforces the boundary | Failure mode + recovery |
|---|---|---|---|
| L0 · Host | nemoclaw CLI, credentials, Cloudflare tunnel, SSH keys | OS user, file permissions, SSH keys | Host crash = everything stops; restart laptop |
| L1 · Host supervisors | 5-min launchd cadence — gateway watchdog, shim watchdog, spawn-poll, webui-ingest, nvidia-liveness | macOS launchd; each plist is independent | If a watchdog dies, launchd restarts it; no in-band supervision needed |
| L2 · Container | OpenShell gateway user, L7 proxy, OPA policy engine | Linux capabilities (capsh drops), policy hash check at boot | Container restart resets all in-process state but state on bind-mounts persists |
| L3 · Sandbox | openclaw gateway, Chad agent, chad-shim.py, sub-agents, memory stores, wrapper-bug shims | Per-binary L7 egress allowlists pinned by /proc/self/exe; per-operator API-key fail-closed |
Gateway OOM → L1 watchdog respawns with 4 GB heap; shim crash → L1 watchdog restarts within 5 min |
| L4 · GHA runner | One-off sub-agent spawns under substrate: gha |
Fresh VM per workflow run, no shared state with Chad | Lossy by design; results round-trip via chad-state branches |
Three of the rings (L0, L2, L3) are kernel/OS-enforced. The L7
allowlists at the sandbox layer are policy-enforced — same syscall
vocabulary, but the squid+OPA combination decides whether each egress
flow is in-policy. L1 is the supervisor tier — it runs no agent
turns and consumes zero inference tokens. State changes detected by
L1 (gateway restarts, shim crashes, spawn-poll reconciliations) flow
back to L3 via the agent-inbox.jsonl event stream so the next cron
agent turn can pick up overnight incidents.
L4 is a per-spawn substrate choice, not an always-on ring. See Substrates.
L3.5 loses L7 enforcement
The GHA breakout swaps kernel-isolation for policy-isolation.
A spawn under substrate: gha runs in a fresh GitHub Actions VM
— strong isolation from Chad's container — but the L7 proxy +
OPA stack does not exist on the runner side. Per-binary egress
allowlists that hold inside L3 do not hold on the runner.
Pick gha for self-contained jobs (codex writing a markdown
report, opencode hitting a public API). Pick local for jobs
that need L7 enforcement (anything writing back to chad-state,
anything touching premium credentials).
The spawn flow¶
A typical chad-spawn invocation walks through this:
sequenceDiagram
participant Caller
participant Spawn
participant Budget
participant Ledger
participant Worker
participant Memory
Caller->>Spawn: chad-spawn kind X task-file T
Spawn->>Budget: reserve N tokens
Budget-->>Spawn: ok or exit 77
Spawn->>Ledger: append status queued
Spawn->>Ledger: append status running
Spawn->>Worker: run binary with prompt
Worker-->>Spawn: result JSON
Spawn->>Ledger: append status done
Spawn->>Memory: chad-collect merges result
The conditional split between the two substrates:
local—Spawnruns the binary in-container under timeout, captures stdout, parses the last line as JSON.gha—Spawnpushes achad-spawn/<id>branch with the rendered prompt and dispatches anagent-job.ymlworkflow. With--async, it returns the task id immediately; thechad-spawn-pollcron reconciles when the runner commitsresult.jsonback. With--sync, it polls until the branch updates.
Self-messages (Spawn->>Spawn) like render prompt from manifest and push spawn branch happen between the steps above; they're elided here so the swimlanes stay readable.
Participants in plain English:
- Caller — Chad, a cron wrapper, or a chat turn.
- Spawn —
chad-spawn, the orchestrator entry point. - Budget —
chad-budget, daily token reservation. - Ledger —
queue/tasks.jsonl, append-only task record. - Sub — the sub-agent binary (kind-dependent).
- Memory — today's
workspace/memory/<date>.mdfile.
The seven sub-agent kinds¶
Each kind is a YAML manifest pinning a binary, a network policy preset, a default substrate, and a prompt template.
| Kind | Binary | Substrate | What it's for |
|---|---|---|---|
coder |
pi |
local | Write/refactor code with build + tests |
researcher |
claude |
local | Brain-first lookups; gh search if brain insufficient |
writer |
claude |
local | Draft mail/docs; never publishes |
reviewer |
claude |
local | Audit a PR diff; GET-only on GitHub |
fitness |
claude |
local | Strength + mobility from gbrain-ingested books |
codex |
codex (npm) |
gha | OpenAI Codex; NVIDIA fallback when no OPENAI_API_KEY |
opencode |
opencode (npm) |
gha | Multi-provider; honors OPENAI_BASE_URL |
Adding a new kind is four files (manifest, policy preset, optionally a runner install step, sync) — no recompile, no image rebuild. Details in Orchestrator.
The kinds above are chad-spawn — imperative one-shot sub-agents.
A second, complementary orchestrator, chad-Smithers, handles
durable multi-step workflows (experiments, model fusion, the
autonomy ladder) with SQLite-backed crash-resume and a web IDE at
runs.supachad.com. The two are bridged, not exclusive: a Smithers
workflow can offload a heavy step to the gha substrate via
chad-spawn-gha. See Runs IDE.
The four memory layers¶
graph LR
subgraph WS["Always-injected (per-session)"]
A[IDENTITY · SOUL · USER]
B[AGENTS · TOOLS · MEMORY]
end
subgraph LTM["Semantic LTM (autoCapture)"]
C[memory-lancedb<br/>NV-Embed-v1 · 4096-dim]
end
subgraph WIKI["Named-entity wiki (bridge)"]
D[memory-wiki vault<br/>Obsidian-style]
end
subgraph BRAIN["Cross-domain hybrid"]
E[gbrain<br/>PGLite vector + graph]
end
Session([Chad session]) --> WS
Session -. recall .-> LTM
Session -. recall .-> WIKI
Session -. CLI subprocess .-> BRAIN
What goes where:
- Identity / principles / autonomy policy → workspace files. Loaded into every main session. Operator-owned, not curated by Chad.
- "Remember this preference / decision / contact" → memory-lancedb. Captures fire automatically on multilingual triggers (remember/preferences/decisions/possessives).
- Per-system reference / per-correspondent rich page → memory-wiki vault. Look up by name. Backlinks form a graph.
- Cross-domain knowledge / book chunks / research → gbrain. Two
books fully ingested for the
fitnesskind today.
The full decision tree is in Memory stack.
The autonomy boundary¶
Every external action passes through the action gate:
flowchart LR
A[Action proposed<br/>e.g. send mail to X] --> G{chad-action-gate}
G -->|auto| S[Ship it]
G -->|draft| P[Park for review]
G -->|block| X[Refuse]
G -->|budget| Q[Defer]
G -->|killed| K[Halted by .auto-disabled]
S --> L[Audit log<br/>auto-action-log.jsonl]
P --> R[Today's memory file]
X --> R
The gate's policy is /sandbox/.openclaw-data/auto-actions.json. It's
intended to be readable — the autonomy roadmap is the diff between
"what's auto today" and "what's draft or block today." See
Autonomy.
Two execution substrates¶
Sub-agent spawns route to one of two backends:
| Substrate | Runs in | Isolation | Network policy | Async |
|---|---|---|---|---|
local |
Chad's container | Process tree shared with parent | L7 policy preset enforced | No (sync only) |
gha |
Fresh GitHub Actions runner | Per-spawn VM | Lost — runner has wide egress | Yes (--async) |
The local substrate is right when the sub-agent needs gbrain, the
sandbox source clone, or NVIDIA inference. The gha substrate is
right when the sub-agent is self-contained — codex writing a markdown
report from a prompt, opencode running against an external repo.
Durable Smithers workflows also offload heavy steps to gha through
the runSpawn() bridge — see Runs IDE.
Where the source lives¶
- NemoClaw (NVIDIA-owned blueprint) — tantodefi/NemoClaw
on the
chad-devbranch. The hardened sandbox image, the policy presets, the orchestrator scripts, the cron wrappers. - gbrain — tantodefi/gbrain. PGLite-backed knowledge brain.
- chad-state (private) —
tantodefi/chad-state. Backup of the workspace dir + the agent-job workflow for the GHA substrate. Not publicly readable; no link. - OpenClaw — openclaw.ai. The agent runtime Chad runs as.
- OpenShell — NVIDIA/OpenShell. The sandbox + L7 gateway.