Changelog¶
A curated record of what shipped on chad-dev. Hand-edited from
git log; not exhaustive. Conventional Commits scope (feat(chad/*))
maps to the headings below.
2026-06-17¶
Chad Lite — a stateless tier for new users + SearXNG web search¶
A lightweight rung for onboarding more Open WebUI users without sharing
the operators' agent or memory. Chad Lite (chad-lite model) is
plain Nemotron Ultra + a generic prompt with no OpenClaw agent,
gbrain, or pod state — memory and web search come from Open WebUI
itself (per-user Memory + a private in-stack SearXNG, json +
limiter:false, keyless, never host/internet-exposed). Isolated by
account, zero new pod load, scales to the Cloudflare 50-seat cap.
Ships with chad-provision-tier0.sh, a chad-webui models
--access-control passthrough, and the multi-user design doc (tiers,
the limits of many OpenClaw agents in one pod, and the auth flow:
Cloudflare's list is the whitelist; Open WebUI auto-provisions on first
login at DEFAULT_USER_ROLE=user). See Front-ends.
2026-06-16¶
Runs IDE at runs.supachad.com¶
A web IDE for durable Smithers workflows, served by a Hono server
(scripts/chad-smithers/serve-runs.js) that reads the Smithers SQLite
DBs directly — full run history across every workflow DB,
auto-discovering new runs, no Smithers daemon to babysit. Four tabs:
Runs (live-polling list + run detail with task tree, outputs,
logs, chat, diffs, trace), Workflows (interactive mermaid DAG from
smithers graph), Editor (CodeMirror read/edit/save of the
.jsx, path-validated), and Experiments (the evolutionary
leaderboard). From the browser the operator can cancel, approve/deny
an <Approval> gate, fork for replay, diff a node, and launch a run.
Fail-closed auth: x-smithers-key (programmatic) or
Cf-Access-Authenticated-User-Email (SSO); writes pass an extra
operator gate. Chad drives the same REST API headlessly via the
chad-runs ESM CLI (health / runs / get / logs / trace / launch /
cancel / fork / approve / experiments / cat / save). Deploy is a
launchd agent on 0.0.0.0:7331 behind the existing cloudflared; the
one remaining step is a Cloudflare dashboard public-hostname route +
Access app for the subdomain.
Model fusion + daily NVIDIA model refresh¶
workflows/fusion.jsx runs one prompt across N models in parallel
(each its own durable task), then a synthesizer fuses them into one
best answer with a best_model pick — the runs IDE gets access to
every model Open WebUI does. The roster stays current via
refresh-models.js (launchd chad-models-refresh, daily 04:45):
fetches NVIDIA's /v1/models, filters to chat/agentic, and writes a
priority-ordered featured list to state/models.json. New NVIDIA
open models join the fusion roster the next morning with no manual
edit. (GLM 5.1 is on the NVIDIA API; 5.2 is not yet — auto-adopted
when it lands.)
Two orchestrators, bridged — and the first ports¶
Settled the chad-spawn vs chad-Smithers question: keep both, bridge
them. chad-spawn stays the GHA-isolated one-shot sub-agent substrate;
chad-Smithers is the durable workflow orchestrator that can call
chad-spawn to offload one heavy step onto a GH runner (reuses the
existing machinery, no parallel GHA system). Then built it:
- The bridge —
scripts/chad-smithers/lib/spawn.js.runSpawn()runs inside a durable Smithers task and shells out tochad-spawn; never throws (always resolves to aresult.jsonshape). Transports: host stub (CHAD_SPAWN_STUB=1), SSH-to-pod (CHAD_SPAWN_SSH, task streamed in / result back since scp is blocked), or local. Portschad-route→route()and issue scoring →scoreIssue()as unit-tested plain JS (8/8). - Migrated multi-spawn flows —
issue-triage.jsx(fetch → score+route →Parallelspawn per issue → report; verified end-to-end) andcontent-pipeline.jsx(research→draft→review spawns → reviewer-gatedApproval→ publish). - Ported cron features —
self-improve.jsx,memory-curator.jsx, andlog-digest.jsx(verified on the host's 29 service logs). All shadow-safe by default.
Mechanical one-shot crons (backups, prune, gc, budget-audit) stay plain wrappers — Smithers is overhead there. See Runs IDE, Orchestrator, Substrates.
2026-06-13¶
Nemotron 3 Ultra 550B + reasoning on by default¶
Adopted NVIDIA's newest, most capable open model
(nvidia/nemotron-3-ultra-550b-a55b) as the default for the hosted
inference profiles. The long-standing harness bug that blocked
reasoning with tool calls (the Kimi K2.5 reasoningSafe=false class)
was retested against Ultra and is absent — Ultra emits standard
tool calls and round-trips cleanly with reasoning on, so reasoning is
now on by default. Self-hosted profiles stay on Super 120B (550B local
is a much larger GPU footprint).
Smithers evolutionary experiment runner (shadow)¶
A durable, host-side experiment runner on
Smithers under scripts/chad-smithers/: a model
router (auto-routes tiers across Nemotron Ultra / Claude subscription /
local), an evolutionary selection engine (start wide → score in
parallel → keep only what works), crash-resumable runs, and a nightly
launchd job that posts the leaderboard to OpenWebUI. Runs in shadow
beside the deterministic chad-experiment-cron. Plus MCP-health-probe
and fail-only cron-report workflows.
Moshi notifications + phone approval¶
Wired the Moshi agent hooks so host claude runs notify the operator's
phone on finish and route approval gates to the phone — the mechanism
behind the autonomy ladder. Added chad-tmux for phone-attachable
host sessions.
2026-05-14¶
Autonomous experiment lifecycle¶
A sixth autonomy loop: chad-experiment-night cron at 02:00 UTC runs
four phases nightly — propose hypotheses from memory, observe running
experiments, evaluate against success metrics, auto-schedule calendar
coordination. Bounded by a per-operator concurrent budget (default 3)
and a regression auto-retire threshold (default -0.30). Materializes
artifacts (automations, functions, tools, knowledge collections,
memories, notes, calendar events) via the chad-webui MCP tools.
Components: chad-experiment CLI (13 verbs), chad-experiment skill
documenting methodology, state/experiments/{config,ledger,active,archive}
ledger format, calendar tag conventions ([chad-block],
[chad-experiment], [operator-sync], [experiment-review]).
Full OpenWebUI v0.9.5 API coverage + native MCP tools¶
chad-webui CLI extended from 33 to 60 sub-commands across 10 groups,
covering every documented v0.9.5 REST endpoint. Companion
chad-webui-mcp stdio server registers each sub-command as a native
MCP tool (webui__<group>_<command>). Fail-closed per-operator
API-key selection (OPENWEBUI_API_KEY_<SLUG>) enforced at the wrapper
layer. New openwebui skill (927 lines) documents every command with
worked examples + composed recipes + nightly experiment playbook.
Host-side watchdogs¶
spawn-poll moved out of openclaw cron (was 288 agent turns/day at
~44k tokens each → ~12.7M tokens/day) to a host launchd job. Same
script, no agent overhead. State-change events flow back to Chad via
a new state/agent-inbox.jsonl stream that cron agent turns can tail
on startup. Also: new chad-shim-watchdog launchd job catches
BrokenPipeError-class shim crashes in 5-min worst case (was 6h).
Wrapper-bug shim layer¶
Five argv-vs-env / hardcoded-path bugs in /usr/local/bin/chad-*
(chad-budget, chad-issue-triage ×3 sites, chad-mail-check,
chad-issue-triage-cron, chad-email-check-cron) caused
silent skips and traceback crashes across the issue-triage,
email-check, and self-improve cron paths. All shimmed via
/sandbox/.openclaw-data/bin/ with cron-payload absolute-path
overrides; documented at docs/operations/wrapper-bugs.md in the
NemoClaw repo.
2026-05-13¶
Memory pipeline consolidation + gateway watchdog¶
Three jobs now converge OpenWebUI chats, workspace events, and system
docs into a single bounded gbrain. chad-gbrain-prune (sandbox cron,
Sundays 02:00 UTC) is the new weekly retention sweep: deletes
memory/<date> and events/<date> pages older than 365 days,
chat/<id> pages older than 180 days, and workspace
dream-digest-*.md/feedback_*.md/prune-log-*.md files older
than 30 days. Default DRY_RUN=1; protects system/*, agent/*,
and the workspace <date>.md journal (source-of-truth).
chad-webui-ingest (host launchd, 04:30 UTC daily) reads chats
updated in the last 25 h from ~/.nemoclaw/openwebui/data/webui.db
(turns live in the chat.chat JSON column — not the message
table, which is for channels), renders each as markdown with
frontmatter, and SSHes the content into the sandbox via
gbrain put chat/<chat-id>. After ingest, the next
chad-gbrain-dream pass picks chats up alongside the workspace
journal. The dream wrapper now surfaces the Embedded/Chunks
ratio in dream-digest-<date>.md and writes
feedback_embed_staleness_<date>.md if coverage drops below 95%
(catches a silently-stalled detached gbrain embed --stale).
chad-gateway-watchdog (host launchd, every 5 min) is the new
supervisor for the OpenClaw gateway. The chad pod has no PID-1
supervisor for the gateway process — when it crashed on
2026-05-12 with a V8 OOM at the default 1.7 GB heap, nothing
restarted it for 9 hours, silently failing every cron and
chad-shim call. The watchdog pings the WS port via SSH, kills any
stale process if needed, and relaunches with
NODE_OPTIONS=--max-old-space-size=4096 to push the OOM
threshold further out.
Critical doc fix. Discovered during the same audit: the chad
pod has no PVC backing /sandbox — verified via
kubectl get pod chad -n openshell -o jsonpath='{.spec.volumes[*].name}'.
The only mounts are openshell-client-tls,
openshell-supervisor-bin, and the kube-api-access SA volume.
Running kubectl delete pod chad wipes ~1.2 GB of state. chad-readme.md
§ 8 and docs/workspace/workspace-files.md were updated to remove
the old "PVC outlives container restarts" claim and add an
SSH-tar backup recipe to take before any pod-recreating operation.
The proper fix is a PVC for /sandbox in the nemoclaw-blueprint
StatefulSet.
2026-05-12¶
OpenWebUI auto-curated dropdown + cloudflared HTTP/2 fix¶
The model picker in OpenWebUI is now auto-curated. A host launchd
job (nvidia-liveness, daily 04:00 UTC) probes every model in
NVIDIA Build's /v1/models with a one-token chat completion,
classifies each as live / dead / unknown / transient
(HTTP 410 Gone marks dead immediately; 4xx/5xx need three
consecutive strikes; network timeouts preserve previous status),
ranks them per-provider by parsed version + params, and writes
featured + per-model status to ~/.nemoclaw/openwebui/liveness.json.
A separate sibling proxy (nvidia-proxy, host launchd, KeepAlive)
sits in front of integrate.api.nvidia.com — passes through every
request unchanged except GET /v1/models, which it filters to the
featured set. OpenWebUI's OPENAI_API_BASE_URL points at the
proxy so the dropdown stays current with NVIDIA's catalog and
hides EOL'd models the day they're flagged.
Curation is one TOML file (scripts/openwebui/nvidia-curation.toml)
with per_provider_limit and a glob exclude list. Default is
flagship-only — ~14 models. Curated model rows in webui.db whose
base_model_id is dead get auto-soft-disabled (is_active=0)
by the liveness sweep, so the dead-llama-3.1-405b entries that had
been producing user-visible 410 errors stopped appearing.
Separately: cloudflared on the tunnel side was reconnecting every
10–30 minutes with QUIC stream timeouts ("no recent network
activity"). Residential routers / NAT firewalls / mobile carriers
prune idle UDP flows aggressively. Forcing --protocol http2
moves the transport to TCP and removes the entire failure class
at ~5–15 ms of extra latency per request.
Also pinned WEBUI_SECRET_KEY from .env so container recreates
don't regenerate the file inside the image layer, which had been
invalidating every existing browser JWT and bouncing users back
to the sign-in page on every redeploy.
2026-05-11¶
Sandbox survival hardening¶
The Bonjour mDNS plugin was crash-looping the gateway on the
loopback-only sandbox. Disabled it in openclaw.json's
agents.defaults. Raised agents.defaults.llm.idleTimeoutSeconds
from 60 s to 180 s so the slow cold-start path on
nvidia/nemotron-3-super-120b-a12b doesn't trip the idle gate
mid-tool-call. Consolidated the split-brain
feedback-proposals.md path so the chad-self-improve wrapper
and the curator agree on a single canonical location. Rewrote
the proton-calendar SKILL.md with a cron-boundary decision
table and a link to EMAIL-POLICY so subagents don't reach for the
calendar when an email reply would do.
A new chad-readme.md § 11.5 documents the deploy gap between
source/scripts/chad-cron-wrappers/ and /usr/local/bin/ on the
running image, plus the kubectl cp recovery procedure when a
wrapper needs to be pushed without rebuilding the image.
2026-05-07¶
Cron context optimization + reliability cleanup¶
Applied --light-context to every registered cron via
openclaw cron edit. Observed input-token reductions of 70–82% on
wrapper-only fires (e.g. chad-skill-watch 177k → 48k, duration
167 s → 55 s). Added
agents.defaults.llm.idleTimeoutSeconds = 60 to openclaw.json so
hung Nemotron sessions fail fast instead of sitting until the
600 s job timeout. Re-registered the three crons that were
documented but missing from the live gateway
(memory-curator, spawn-poll, spawn-gc); the chad-readme
listed nine but chad-setup.sh had since grown to register
eleven. Sticky Error statuses on chad-proposal-apply,
chad-skill-watch, self-improve, and chad-budget-audit cleared
once the underlying delivery Channel is required failure was
resolved by re-applying --no-deliver --best-effort-deliver to
every job.
2026-05-06¶
Async sub-agent spawns¶
chad-spawn --async --substrate gha now returns a task_id immediately
and writes a running ledger entry. The new chad-spawn-poll cron
(every 5 min) reconciles when the GHA runner commits result.json back
to the spawn branch, transitioning the ledger and triggering
chad-collect. Async mode deletion via chad-spawn-gc (weekly Mon)
prevents chad-spawn/* branches from accruing on chad-state. Added
opencode kind + preset (multi-provider coding CLI) and a new
--binary-override flag for per-spawn binary swap (e.g., spawning
writer with the codex binary instead of claude while keeping
the writer's prompt template and timeout). The kind's L7 policy
preset still applies — the override binary must be in that preset's
allowlist. Ledger entries now carry substrate and async fields.
NVIDIA fallback in the GHA agent-job runner¶
The runner now picks an inference provider at workflow time. With only
NVIDIA_API_KEY set, codex / opencode kinds route through
integrate.api.nvidia.com/v1 (OpenAI-compatible) targeting Nemotron;
adding OPENAI_API_KEY automatically upgrades to real OpenAI.
claude requires ANTHROPIC_API_KEY (no NVIDIA fallback —
Anthropic-only). The resolved provider lands in result.json so
chad-collect knows which backend produced each output.
GHA substrate for chad-spawn (popebot pattern)¶
Sub-agent spawns can now route to a GitHub Actions runner per job.
New --substrate <local|gha> flag on chad-spawn, kind manifests gain
a substrate field, chad-spawn-gha.sh helper handles
push + dispatch + sync poll. Workflow template + chad-state-bootstrap
installer ship in the source repo. The branch on chad-state is the
job record — no jobs table, no queue service.
Codex kind stub + design doc¶
First non-built-in kind, validating the "no recompile, no image
rebuild" extensibility claim. Architecture for spawn substrates
captured in docs/design/spawn-as-github-run.md.
Hermes-style memory curator + pre-mutation snapshots¶
chad-memory-curator runs weekly Sat 04:00 UTC. Inactivity-gated,
budget-guarded, draft-only. Snapshots lancedb + wiki + workspace
before any work via chad-memory-snapshot (tar.gz, keep last 5,
reversible rollback). Memory-lancedb autoCapture self-healed to
true by chad-setup.sh Step 3e — the unlock for Chad's previously
empty LTM. Step 3f actively removes gbrain from mcp.servers if
re-introduced (PGLite single-process file lock means MCP serve
blocks every cron wrapper that uses gbrain CLI).
2026-05-05¶
Memory plugin stack wired¶
memory-lancedb (NV-Embed-v1 4096-dim) + memory-wiki (bridge mode) + active-memory + tokenjuice configured and loaded. Two-layer architecture: semantic LTM via lancedb, named-entity wiki for retrieval-by-name.
Self-improvement loops closed¶
chad-self-improve weekly cron, chad-budget-audit weekly cron,
chad-proposal-apply daily cron with safe-list. Cron-tuning
proposals can land via the gate; anything riskier stays draft-only.
2026-04-30 → 2026-05-04¶
NIM embeddings overlay and gbrain integration. Workflow regression matrix with variants and parallel sub-agent harness. Fitness kind + brain-first pattern. Premium routing with Anthropic side-channel. Per-task budget profiles + model registry. Hybrid Phase-1/Phase-2 cron wrapper infrastructure.
The current state of chad-dev is what the rest of these docs
describe. If anything here feels stale, the source repo is the
canonical reference — please file an issue.