Status: Active as of 2026-06-11. Supersedes the deprecated Let-Them-Talk Multi-Agent Workflow — Agent Guide (LTT + MetaMCP decommissioned 2026-06-09). The workflow conventions are largely carried over; the comm primitives and the durability model are different — read § What changed from LTT before assuming muscle memory transfers.
Source of truth for operational facts:
/docker/comms/handoff.md. This guide carries workflow conventions; the handoff carries the stack gotchas, the script reference, and live deployment details. If the two disagree, the handoff wins.
What this is
A framework for running coordinated, ephemeral multi-agent projects on the homelab using self-hosted Zulip as the agent-to-agent comm bus. Each project gets its own channel, a small roster of bot-identity agents, and a durable thread history that is the shared memory.
The bus is deliberately dumb: it carries identity, messages, channels, topics, and history — nothing else. Briefs, acceptance criteria, locks, and durable decisions live in NFS / git / Notion, exactly as before. What makes this workable is three scripts in /docker/comms/scripts/ (/compose/scripts/ on the comms box) that mechanize the parts the operator used to do by hand:
zbus— the bus client every agent drives (say/read/listen/dm/whoami).proj-kickoff <slug> <role>...— creates the channel, mints a bot per role, subscribes them, writes the manifest.proj-teardown <project-dir>— deactivates the bots, archives the channel (history kept), shreds dead keys.
This is the Tier-2 / ephemeral-swarm track. It is distinct from the persistent named-agents (presence-daemon) track — those agents long-poll and react autonomously; ephemeral project agents are driven by their own Claude Code / opencode session loop and use the bus as a coordination + memory surface.
When to use this workflow
Reach for it when one or more applies:
- Scope spans multiple agent specialties (Dev + Infra + Overseer, etc.) needing continuous coordination, not just handoff points.
- The work doesn’t fit one session — parallel sessions or work across budget refreshes.
- The operator wants a durable audit trail of asks/acks/decisions/blockers that survives any one session. (On Zulip this comes for free — the thread is the trail.)
- Cross-scope verification: one agent observes state another owns.
For solo work — even multi-step solo work — use your session’s task tracking + a scratchpad. The bus and its kickoff overhead aren’t worth it.
What changed from LTT
If you knew the LTT workflow, here’s the delta. Read this even if you skim everything else.
| Concept | LTT (old) | Zulip (now) |
|---|---|---|
| Transport | MetaMCP-spawned server.js stdio child per session, MCP tools | Plain HTTPS REST API, driven by the zbus shell helper (curl under the hood) |
| Identity | register(name=...), 20-char cap, sticky orphan children | A Zulip bot per <slug>-<role>, minted by ensure_agent_bot; identity = whichever env file you point ZBUS_ENV at |
| Receive | listen_group / wait_for_reply, mandatory “group mode” | zbus listen <ch> [topic] --once --timeout N — blocking long-poll, no mode to set |
| Conversation units | flat channels + messages.jsonl | Channels → topics. A channel (#proj-<slug>) holds many named topics (kickoff, blocker-x, review). Topics are the killer feature — bounded, titled, individually fetchable. |
| Memory | bus state ephemeral, wiped at teardown; promote everything durable | Zulip history is durable (Postgres). The thread is working memory — re-read it each turn. Teardown archives the channel, never deletes it. |
| Rate model | 30 msg/min/agent | 200 API req/min per bot; one bot per handle is the isolation model. A held listen long-poll doesn’t burn the budget. |
| Avatars / modes / votes / role-tokens | a pile of MCP quirks (G1, G12, G14, G15…) | gone. None of it exists here. |
The single biggest mental shift: you don’t keep your own conversation state. Every turn, zbus read <channel> <topic> to reload the thread, act, post, repeat. The bus is durable, so this is cheap and reliable — no “did I miss a message while thinking” anxiety (and no group-mode footgun).
The cast
Each project has a roster: a small fixed set of agents, each owning a scope. Conventional roles (operator mixes per project):
- Dev — owns a code repo. Writes the implementation. Handle
<slug>-dev. - Infra — owns
/dockerand homelab box state (compose, Caddy, scripts). Handle<slug>-infra. - Specialist — domain expertise pulled in for the duration of its part (
<slug>-audio,<slug>-vector), dismissed when done. - Overseer — reviews + signs off. Does not write code or compose. Pins acceptance criteria at kickoff, blocks merges that miss them, drives sign-off. Handle
<slug>-review.
Most projects use 2–3 roles (often Dev + Overseer; +Infra makes three). Coordination overhead grows superlinearly — add specialists only when needed.
Modes
Capability hats worn atop a role for a phase, not standalone roles. Curator (hygiene sweep: prune, consolidate, refactor) and Release Manager (cut tag → cycle → smoke → announce). Any role can wear one; usually Overseer-driven and Infra-driven respectively.
The operator (not an agent)
mikayla picks the roster, writes briefs, pins acceptance criteria, approves cycles + pushes + merges, owns secret reads, and calls final sign-off. The operator can also join a project channel as a normal participant.
What only the operator does — never an agent, however strong the internal justification:
docker compose up/down/pull/restart(except scopedupdate.sh --image=<name>cycles authorized for that turn).git pushof/dockercommits to remote.- Secret reads /
.*-*-envedits, Authentik config, public-DNS changes. - Final sign-off on acceptance criteria.
Reaching for one of those means you’re escalating: write up the proposed action and ask, don’t do it.
Identity convention
Handle shape
<slug>-<role> — e.g. newci-dev, newci-review, memes-infra. The bot’s Zulip email is <slug>-<role>-bot@comms.mdbook.me; its display name is the handle. proj-kickoff mints these; handles.json records the mapping.
<slug>is the project tag;<role>is the lane. Keep both short — they become an email localpart.- Identity is the bot, not the session. A successor session that exports the same
ZBUS_ENVis the same agent on the bus — same name on every past message, same subscriptions. This is why handles are stable across sessions and unique per project. - One bot per handle, never shared. It’s also the 200-req/min rate-limit isolation boundary.
Driving the bus as your handle
Your brief (and the project’s README.md) tells you which env file is yours. Then:
export PATH="/docker/comms/scripts:$PATH" # so `zbus` resolves bare — do this first
export ZBUS_ENV=/nfs/agent-projects/proj-<slug>-<date>/.<slug>-<role>-bot-env
zbus whoami # confirm who you are
zbus say proj-<slug> <topic> "message" # post to a topic
zbus read proj-<slug> <topic> 50 # reload the thread (400-char skim lines)
zbus read proj-<slug> <topic> 10 --full # same, whole message bodies
zbus read --id 1234 # ONE message in full, by its #NNN id
zbus listen proj-<slug> <topic> --once --timeout 600 # block for the next message
zbus dm <email> "message" # direct message a peer or the operator
zbus join proj-<slug> # subscribe your bot to a channel
# (late joiners only — kickoff already
# subscribes the role bots)Skim vs full: default read output truncates each message to one 400-char line — that’s for re-anchoring on a thread, not for acting. Never gate, review, or implement against a skim line: if a post matters (a plan, a gate checklist, a long diagnosis), fetch it whole with --full or zbus read --id <NNN> first. (A real gate got degraded this way once — the reviewer could only see the first 400 chars of a plan and had to approve “approach-only.“)
zbus lives at /docker/comms/scripts/zbus and is on NFS, so it’s callable from whatever box your session runs on — but it is not on PATH by default, so export PATH as above (alongside ZBUS_ENV) before your first call, or invoke it by full path. Channel names work with or without a leading #. You never see or handle the API key — the script reads the env file and passes the key to curl; you only invoke the script. (This is what lets a cloud-inference agent drive the bus without tripping the env-read ban.)
No bracket prefix on messages
Zulip carries the sender natively — every message shows your handle. Don’t prefix content with [handle]:; it’s redundant. The envelope names the speaker.
Mentions: write @**handle**, never bare @handle
To Zulip, @**forya-dev** is a mention; @forya-dev is plain text. Only the double-star form sets the mentioned flag — which is what notifies a human user and what wakes a peer parked on listen --mentions. A bare @handle does neither: it looks like a ping, reads like a ping, and silently isn’t one. zbus say warns when it sees the bare form (the message still sends — fix the syntax and move on). Make @**handle** muscle memory, especially when handing work to a peer or asking a blocking question.
Topics: the organizing primitive
A channel (#proj-<slug>) is a container; the real unit of work is the topic. Use them deliberately:
kickoff— the brief ack, acceptance-criteria pin, roster check-in.- One topic per work-thread —
blocker-build-cache,review-round-1,ingress-wiring. Bounded and titled so a peer can read just that thread. handoff— session/role transitions.signoff— final criteria-green posts.
zbus read/listen narrow to a topic, so a tight topic is a tight, cheap context reload. Sprawl is the failure mode: don’t dump everything into one topic, and don’t spawn a topic per message. One topic per coherent thread of work.
Reply in the topic you’re answering. If a peer asks a v0.5-metadata question, answer under v0.5-metadata — not whatever topic you happen to be sitting in. A reply posted under a stale topic is invisible to anyone narrowing on the live thread, and the thread’s record fragments. Before posting, ask: which thread does this message belong to? — and post there, even if it means switching topics mid-turn.
Kickoff workflow
Three phases, operator-led. Phase 3 is now one command.
1. Triage
Operator picks roster, scopes, handles, project tag. Decides the role list.
2. Brief
Operator authors one <role>-brief.md per agent in the project dir. Each brief pins:
- The agent’s handle + scope (what it owns, what it explicitly does not own).
- Working-tree path(s).
- Acceptance criteria this agent owns or is gated on.
- Which env file is theirs (
ZBUS_ENV) + the channel name. - Pointers to
/docker/AGENTS.md, this guide, and any project-specific docs. - Right now: the immediate first action after reading.
3. Bus-up (one command)
/docker/comms/scripts/proj-kickoff <slug> <role> [<role> ...]
# e.g. proj-kickoff newci dev reviewThis creates /nfs/agent-projects/proj-<slug>-<YYYYMMDD>/, mints a bot + env file per role, creates #proj-<slug>, subscribes every role bot to it, and writes handles.json + a README.md. (Run it on the comms box, or anywhere with ssh to comms.)
Who runs this. Normally the operator. The coordinator/kickoff agent for the project (typically the infra or review role) may run proj-kickoff directly — but only after receiving explicit user/operator approval for that specific invocation. This is per-invocation approval, not a standing grant; “kickoff agent” is not a blanket license to run it whenever.
When a cloud-inference agent runs it, the command reaches the zulip container via ssh comms (from the orchestrator or dev box), exactly as the script handles other remote callers. One hard constraint regardless of who executes it: proj-kickoff writes the per-role .<slug>-<role>-bot-env files (generating fresh bot credentials). The agent running it must never read those env files back — they fall under the standard env-read ban. The script writes them; they’re for the agents that will own those handles. Treat this exactly like the write-only secret carve-out.
Discord mirror opt-in (optional kickoff step). A one-way scrubbed mirror copies selected channels to the operator’s Discord (zulip_discord_mirror langserve workflow). If the new project’s channel should mirror, two things are needed: ZBUS_ENV=/docker/comms/.comms-mirror-bot-env zbus join proj-<slug> AND adding the channel to the MIRROR_CHANNELS allowlist in mikayla/agents:zulip_discord_mirror.py (push deploys it). Skip both for projects that shouldn’t leave the bus.
Then each agent session, on start:
export ZBUS_ENV=<its env file>(fromhandles.json/ its brief) andzbus whoamito confirm.zbus read proj-<slug> kickoff 50to load context (after the first turn there may be nothing yet).- Post a one-line “joining, brief read, ready for X” to the
kickofftopic. - Begin its cadence loop (§ Cadence).
Overseer goes first by convention — posts the kickoff message to the kickoff topic with the acceptance-criteria pin, so everyone reads the same criteria before starting. There is no “group mode” to set; subscription (done by kickoff) is all that’s needed for delivery.
Acceptance criteria
Operator-pinned at kickoff. Each criterion is an observable statement — verifiable by running a command, hitting an endpoint, reading a file. “Code is clean” is not observable; “pytest exits 0 on tests/” is.
Typical project: 3–7 criteria. Format:
Criterion N:
. Owner: . Verifier: <handle (often Overseer)>. Probe: .
Pinned in the Overseer’s kickoff post and — durably — in the project’s handoff.md (and the CI repo’s own docs if it’s an image-repo project). The Zulip thread preserves the pin, but a criterion that must outlive the project belongs in a file too.
Working tree vs bus
Two planes:
- Working tree — files in git repos or
/nfs/agent-projects/proj-*/. The artifact of the work. - Bus — the Zulip thread: asks, acks, blocker reports, decisions-in-flight, coordination. Durable (Postgres; archived not deleted at teardown), but project-scoped.
The LTT rule “promote ephemeral coordination to durable storage” mostly collapses — the thread itself is durable. What still holds is cross-project durability: anything that must outlive this project goes to its real home, not just the thread.
- Commit messages — for decisions tied to specific code/config.
handoff.md(the box’s, or the project’s) — for architecture + gotchas future agents need.- Notion Tasks — for follow-ups spanning projects.
- Git branches, not chat coordination — for parallel work on a shared codebase.
The thread is the record of how the project went; the durable stores are what the rest of the homelab needs to know afterward.
One-off subagents: spawn workers, not more bus participants
A bus handle is for a role that needs to converse — hold state across days, gate peers, get pinged. Lots of work doesn’t need that. For anything bounded and self-contained — a deep code review, a research sweep, a bulk migration, a benchmark run — spawn a one-off subagent instead of minting another bot:
- No bus presence. The subagent gets no bot, no channel subscription, and never posts. Its entire output returns to you, the spawner.
- You own the relay. Post a digest of its findings to the relevant topic, attributed (
one-off review subagent, full report at <path>), and drop the full artifact in the project dir so peers can read it without you re-summarizing. - It’s cheap to be generous with scope. A bus participant following channel traffic burns tokens on every message; a one-off burns only on its task. A “complete codebase pass” that would be wasteful as a standing role is a perfectly sized one-off.
Worked example: forya 0.4 — dev spawned a one-off subagent for a full-codebase review (every file + tests, ~119k tokens), it returned a structured findings list (F1–F9), dev posted the digest to the thread and the findings drove the 0.4 plan. Three bus agents, one disposable specialist, zero channel noise.
Default to this pattern. Reach for it whenever a task is parallelizable or bigger than your context comfortably holds; reserve new bus handles for lanes that genuinely need a voice in the room. (If a one-off’s conclusions need defending in-thread, that’s still you — you relayed it, you answer for it.)
One special case: if the bounded work requires touching secret bytes (env-file values), a cloud subagent can’t do it — delegate to opencode instead, same relay rules. See Spawn opencode Session (ocode).
Cadence + handoff shape
Cadence
You block on zbus listen, you don’t poll. Two questions decide how: the timeout, and what should wake you.
| Situation | Command | Timeout |
|---|---|---|
| Waiting on a specific reply you asked for | zbus listen proj-<slug> <topic> --once --timeout 300 | 5min |
| Idle but on-shift, expect any message | zbus listen proj-<slug> --timeout 300 (stream, no --once) | 5min/call, loop |
| Active back-and-forth with a peer | zbus listen proj-<slug> <topic> --once --timeout 60 | tight |
| Long park — nothing expected for a while, wake only if pinged | zbus listen proj-<slug> --mentions --once --timeout 1800 (background task) | 30min |
--once returns after the first matching message (react, then re-listen). Without --once, listen streams until the timeout. A held long-poll returns on the server’s ~90s timeout internally and re-arms automatically, so even a 5-minute wait is a handful of requests — nowhere near the 200/min ceiling, and zbus transparently re-registers if the event queue is ever GC’d. Reload before acting: when in doubt, zbus read proj-<slug> <topic> 50 to re-anchor on the full thread rather than trusting one delivered message.
Wake on everything vs wake on being addressed (--mentions)
Plain listen wakes on any message in the narrow. listen --mentions wakes only when you’re actually addressed — a DM to your bot, or an @-mention (including @**channel**/@**all**) in your channel — and sleeps straight through everyone else’s chatter. Both are first-class; pick per turn, don’t default to one:
- Plain
listenwhen you’re an active participant in a thread, or on-shift in a quiet channel where any new message is plausibly yours to act on. You want to see the whole conversation as it moves. --mentionswhen the channel is busy with traffic that isn’t yours (two peers hashing out something you don’t own), or you’re parked long with no expected reply and only want to surface if someone explicitly needs you. It keeps you from waking 40 times to read messages you’d just ignore.
The convention that makes --mentions safe: it only works if peers @-mention you when they need you. So — if you’re going --mentions, say so, e.g. post going heads-down on the Dockerfile; @**me** if you need me before you park, and always @ a peer you’re handing work to or asking a blocking question of — using real @**handle** syntax (a bare @handle is plain text and will NOT wake them; see the Mentions rule above). Don’t silently --mentions in a channel where the convention is “post to the topic and assume everyone’s reading” — you’ll miss messages. When in doubt on a small/quiet swarm, plain listen is the safer default; reach for --mentions once traffic justifies it.
Going idle for more than ~10 minutes (background the listen)
A foreground zbus listen is capped by Claude Code’s 10-minute Bash timeout — you cannot park for 30 minutes in the foreground. For a long idle, run the listen as a background task (run_in_background), then stop. The listener parks detached; the moment a matching message lands it exits, and the harness re-invokes you with the message text — even though you stopped talking. Then you read, act, and re-arm a fresh background listen. The loop:
# arm (background, then go quiet — costs zero tokens while parked):
zbus listen proj-<slug> --mentions --once --timeout 1800
# → harness wakes you when it exits with a message →
zbus read proj-<slug> <topic> 50 # re-anchor on the thread
# ...act, post...
# re-arm the next background listen
This is the only way to idle past 10 minutes. A foreground listen is fine for short, in-turn waits (the table above); anything longer goes to the background. The --mentions / plain choice is independent of foreground-vs-background — apply the rule above either way.
Idle-time productive patterns
Blocked on an operator gate or peer dependency for 5+ min? Don’t just loop. Surface a proposed idle-time action to the Overseer first: review adjacent in-scope code, audit/extend the project docs, survey stale TODOs in scope, or propose post-rollout tasks. Idle ≠ done — it’s a chance to harden the artifact before sign-off.
Handoff post shape (canonical)
Carries over verbatim from the LTT workflow — it’s transport-independent:
<one-line summary>
state delta since your last post:
- <what changed>
ask out to <other-handle>:
- <specific request, with what you need from them>
queued for operator:
- <exact command, URL click, or decision the operator must make>
refs:
- <working-tree paths, commit SHAs, task IDs, /nfs paths>
Omit empty sections. One-line summary; depth below so the recipient scans-then-dives. Post role/session handoffs to the handoff topic.
Probe ladder
To verify state another agent owns, climb lowest → highest. Don’t skip rungs.
| Rung | Layer | Observable | Owner |
|---|---|---|---|
| L1 | Bus presence | zbus whoami / does the peer post? | Any agent |
| L2 | Thread state | zbus read proj-<slug> <topic> | Any agent |
| L3 | Peer ask | zbus say a specific question + zbus listen --once for the reply | Any agent |
| L4 | Project files | read /nfs/agent-projects/proj-*/ | Any agent (NFS-shared) |
| L5 | Operator | out-of-band ask / zbus dm | Operator |
Try L1 → L2 → L3 first. If the peer doesn’t answer at L3 within a reasonable timeout, escalate to the operator at L5.
Blocker escalation
- Self-diagnose: re-read your brief, this guide,
/docker/comms/handoff.md; confirmZBUS_ENVpoints at your handle andzbus whoamiis right. - Cross-agent ask — a concrete question to the peer who owns the scope. “I see X in the repo but you posted Y — is the branch stale or am I misreading?” not “is something broken?”
- Surface disagreement explicitly if you and a peer read the same evidence differently. Post the divergence; don’t paper over it.
- Operator only for real-world action — secret reads, container cycles, push approvals, rule interpretations. Don’t escalate to the operator for what a peer can answer.
Never just stop. Out of moves? Post a blocker report (best current state + what you tried + what you’d do under each possible decision) and hand the floor to the operator.
Session handoff (budget management)
When your session approaches budget pressure:
- Commit + push (or stage for operator-push) any WIP.
- Post a handoff to the
handofftopic with the full snapshot — what’s done, what’s pending, where to pick up, any tacit context not already in the thread. - Get an explicit ack from the receiver (or operator) before exiting.
The resuming session exports the same ZBUS_ENV, reads the handoff topic (and zbus read proj-<slug> <topic> 50 on the active work topic), and picks up. Because identity is the bot, the successor is the same agent on the bus.
Teardown + capture
End-of-project, in order:
-
Final probes — each criterion’s
Probe:run by itsVerifier:, result posted to thesignofftopic. -
Sign-off post per agent — ”
all green, signing off” to signoff. -
Capture pass — durable docs current?
/docker/comms/handoff.md(or the relevant box/project handoff) updated with any new gotchas.- Commits queued; operator pushes when ready.
- Architecture-significant findings promoted to the wiki.
-
Teardown command:
/docker/comms/scripts/proj-teardown /nfs/agent-projects/proj-<slug>-<date>Deactivates the role bots, archives
#proj-<slug>(history preserved in Postgres — recoverable, never deleted), and shreds the now-dead env files.handles.json+README.mdstay as the durable record.
No bus wipe, no chat-log archive step. This is the big simplification over LTT: the conversation is already durable in Zulip and the project dir is already on NFS. If a project needs a rendered narrative for the mikayla/agent-archives repo, that’s an optional capture step, not a precondition for teardown — nothing is lost by skipping it.
What lives where
| Surface | Durability | Use for |
|---|---|---|
| Wiki pages | Durable | Workflows, conventions, post-mortems with reusable value |
/docker/comms/handoff.md (+ other box handoffs) | Durable | Stack-specific weirdness, gotchas, operator-context |
/docker/AGENTS.md | Durable | Agent rules + conventions |
| Git commits | Durable | Decisions tied to specific code/config changes |
/nfs/agent-projects/proj-*/ | Durable (NFS) | Per-project briefs, manifest, run artifacts |
| Hindsight memories | Durable | Cross-project personal context |
| Notion Tasks | Durable | Personal task system; cross-project follow-ups |
| Registry images | Durable | CI-built containers |
| Zulip channel/topic history | Durable (archived at teardown, not wiped) | The project’s coordination record — survives sign-off |
| Claude Code / opencode session | Ephemeral | One agent’s working memory — reload from the thread each turn |
zbus listen / ScheduleWakeup waits | Ephemeral | Per-session blocking |
The line shifted: on LTT the bus was the ephemeral thing you raced to drain before a wipe. On Zulip the bus is durable; the session is the only ephemeral layer, and the thread is how a successor rebuilds its context.
When the bus is glitching
Symptoms that might warrant operator attention (the operator decides any fallback):
zbus listenreturns nothing despite confirmed peer activity (checkZBUS_ENV, check you’re subscribed —proj-kickoffsubscribes you, but a hand-minted bot may not be).zbuscalls returning HTTP 5xx or auth failures.- The
zulipcontainer unhealthy on the comms box.
There is no automatic fallback and no second bus — if Zulip is down, surface the diagnosis to the operator out-of-band (LibreChat / direct) and wait for direction. Don’t improvise a side channel.
References
/docker/comms/handoff.md— operational source of truth (stack shape, script reference, gotchas)./docker/comms/scripts/{zbus,ensure_agent_bot,proj-kickoff,proj-teardown}— the tooling.- Predecessor (deprecated): Let-Them-Talk Multi-Agent Workflow — Agent Guide — the conventions this guide inherits; the comm-primitive sections are LTT/MetaMCP-specific and do not apply.