Companion to: Zulip Multi-Agent Workflow — Agent Guide (§ When the bus is glitching).
When to read this: when the Zulip bus is healthy you don’t need this guide. Read it once for awareness; come back only when fallback has actually been triggered.
Hard rule: a worker never switches to chat.log fallback unilaterally. The operator triggers fallback. The one exception is the persistent-swarm / AFK case below — and even there it’s the governor’s considered call, not any worker’s. If you think the bus is broken, post a diagnosis (confirm it’s not just your own
ZBUS_ENV/ subscription), surface to the operator out-of-band, and wait for direction.Lineage: this supersedes the retired Let-Them-Talk Chatlog Fallback (LTT + MetaMCP decommissioned 2026-06-09). The mechanism — an append-only NFS
chat.logwatched by a per-agent hook — carries over unchanged; the bus-specific bits are rewritten for Zulip.
What this is
A degraded-mode comm pattern that uses an NFS-shared append-only chat.log file as the coordination surface when the Zulip bus is unreliable or down. It’s the answer to “the bus is dead, the operator is asleep, and the swarm still has work in flight” — keep agents coordinating over a flat file until the bus is back.
Chat.log is intentionally lossy compared to the bus:
- No parked long-poll — agents can’t
zbus listenand block; they poll the file via a hook or re-read on demand. - No
@**handle**phone-push — nothing notifies the operator’s phone; they (or the governor) read the file directly. - No topics, no Discord mirror, no durable Zulip thread — one flat file is the whole surface, and nothing leaves NFS while fallback is active.
Accept the degradation. The goal is to keep agents coordinating until the bus is back — not to rebuild the bus on top of a file.
When to use it
Fallback is warranted when the bus is genuinely untrustworthy and the standard checks haven’t fixed it (see the main guide § When the bus is glitching):
zbuscalls returning sustained HTTP 5xx / auth failures, or thezulipcontainer unhealthy on the comms box with no quick recovery.- A bus cycle that keeps dropping parked listeners faster than agents can re-arm.
- Silent delivery failures that persist after confirming
ZBUS_ENV, subscription, and a repost.
Symptoms that AREN’T fallback triggers (fix these in place, don’t switch surfaces):
- One agent seeing nothing — check your
ZBUS_ENVand that you’re subscribed (proj-kickoffsubscribes you; a hand-minted bot may not be). - A single dropped high-stakes post —
zbus readto confirm, repost if missing. - A deaf parked listener after one server restart — just re-arm your
listen.
Who triggers it
Default: the operator. Fallback is a surface switch for the whole swarm; one agent deciding it unilaterally splits the swarm across two surfaces, which is worse than a slow bus. So a worker never flips it on its own — it surfaces the diagnosis and waits.
Exception — persistent swarm, operator unreachable. On a long-running, partly-unattended swarm (see the main guide § Long-running / persistent swarms), the governor may, under careful consideration, direct fallback — but only when both hold:
- The operator is genuinely unreachable (asleep / mobile / out — not merely slow to answer), and
- The bus is confirmed dead and won’t recover in the window the work needs, with no operator available to fix it.
Even then it’s a deliberate, announced, logged decision — the governor posts the switch marker (below), names the reason, and the swarm reverts to operator authority the moment they’re back. This is a narrow evolution of the old “only the operator triggers fallback” rule for the AFK case; the core still holds — no worker flips it alone, and the governor doesn’t flip it just because the operator is slow.
Setup
Run by whoever triggers it (operator, or governor under the exception):
- Create the project chat.log if it doesn’t exist:
touch /nfs/agent-projects/proj-<slug>-<date>/chat.log - Confirm the roster is durably recorded — the brief /
handoff.mdlists every agent’s handle, so anyone reading chat.log can attribute and address posts without the bus. - Drop the switch marker:
- On the bus (last gasp, if it’s limping):
zbus send <channel> <topic> "Switching to chat.log fallback. Reason: <one-liner>. Canonical surface: /nfs/agent-projects/proj-<slug>/chat.log. Install your watcher hook." - In chat.log (first entry, append-only): see § Post shape.
- On the bus (last gasp, if it’s limping):
- Each agent installs the watcher hook in its session — see § Hook installation.
- Each agent posts a “joined fallback” entry in chat.log.
From here, chat.log is canonical. The bus may still be limping; ignore it until the operator (or governor) declares it restored.
Append-only hygiene
Use >> (append). Never > (overwrite). In an LTT-era project a stray > instead of >> truncated the entire chat.log mid-run; the lineage was only recovered because one agent still had the prior context in session scrollback. Don’t repeat that — it’s the one mistake this pattern can’t tolerate.
Safe append idioms:
# Heredoc append
cat >> /nfs/agent-projects/proj-<slug>/chat.log <<'EOF'
=== 2026-06-14 14:30 EDT | <slug>-infra ===
<body>
---
EOF
# Single-line append
printf '%s\n' "=== ... ===" >> /nfs/agent-projects/proj-<slug>/chat.log
# tee — only with -a
some-command | tee -a /nfs/agent-projects/proj-<slug>/chat.logNever: > chat.log (truncates) · tee chat.log (truncates — tee defaults to overwrite) · editing the file open in an editor that can clobber on save.
Truncation recovery
- Stop appending immediately.
- Check whether any agent’s session still has the prior contents in scrollback — they can paste-recover.
- If recovered: append it back under a clear
--- RECOVERED: TRUNCATED AT <byte>, RESTORED FROM <handle>'s SESSION ---separator. - If not: the lost portion is gone — post a “chat.log truncated at
The append-only rule is the load-bearing one. Everything else is style.
Post shape
Every entry is a self-contained block separated by ---. The header names the timestamp + handle; the body follows the canonical handoff shape from the main guide (§ Cadence + handoff shape). The bus carries identity in its envelope; a flat file doesn’t, so the header does that job.
=== <ISO timestamp with timezone> | <handle> ===
<one-line summary>
state delta since your last post:
- <what changed>
ask out to <other-handle>:
- <specific request>
queued for operator:
- <exact command or click sequence>
refs:
- <working-tree paths, commit SHAs, /nfs paths>
---
Notes:
- Timestamp the header, not the body lines — future readers only need block-level chronology.
- Empty subsections can be omitted. Keep it clean.
- No length limit (unlike a Discord mirror’s char ceiling) — long bodies use the same shape.
- Bus-era discipline still applies in-file: a
STATE ANCHORblock (shipped / in-flight / decided / next) re-posted as the run moves, latest wins; aLOCKED:block supersedes its own back-and-forth. They’re just chat.log entries now instead of pinned posts.
Hook installation
Each agent installs a hook that auto-detects chat.log changes, computes the delta since it last looked, and injects it as additionalContext so the agent reads it on the next turn. It runs on UserPromptSubmit (primary) and SessionStart (boot catch-up).
Only install the hook when fallback is active. It’s pure noise during normal bus operation.
Put it in the agent’s CLAUDE_CONFIG_DIR/settings.json if the swarm uses per-agent config dirs (see the mechanics page — then it rides with the identity), or in the repo’s .claude/settings.local.json otherwise:
{
"hooks": {
"UserPromptSubmit": [
{ "command": "/nfs/agent-projects/proj-<slug>/.<handle-slug>-watch-chatlog.sh" }
],
"SessionStart": [
{ "command": "/nfs/agent-projects/proj-<slug>/.<handle-slug>-watch-chatlog.sh" }
]
}
}Replace <slug> with the project slug and <handle-slug> with a filesystem-safe form of the handle (e.g. forya-dev).
Why UserPromptSubmit (not Stop)
UserPromptSubmit reliably injects hookSpecificOutput.additionalContext into the next turn. Stop does not inject additionalContext in the current harness. SessionStart only fires on boot — good for “what did I miss while gone,” but partial. The recipe: UserPromptSubmit (primary) + SessionStart (boot catch-up); skip Stop.
Watcher script template
POSIX shell. Save as /nfs/agent-projects/proj-<slug>/.<handle-slug>-watch-chatlog.sh and chmod +x. Each agent uses its own handle-keyed cursor file — prevents two agents fighting over one cursor.
#!/bin/sh
# UserPromptSubmit/SessionStart hook: print any new chat.log content since the
# last check, emitted as hookSpecificOutput.additionalContext so Claude reads it
# on the next turn. Cursor (byte-offset of last-seen end) lives next to the
# chat.log, scoped by handle so each agent has its own.
set -eu
CHATLOG=/nfs/agent-projects/proj-<slug>/chat.log
CURSOR=/nfs/agent-projects/proj-<slug>/.<handle-slug>-chatlog.cursor
[ -f "$CHATLOG" ] || exit 0
hook_input=$(cat)
event=$(printf '%s' "$hook_input" | jq -r '.hook_event_name // "UserPromptSubmit"' 2>/dev/null || echo UserPromptSubmit)
size=$(wc -c <"$CHATLOG" | tr -d ' ')
prev=$(cat "$CURSOR" 2>/dev/null || echo 0)
case "$prev" in ''|*[!0-9]*) prev=0 ;; esac
# Cursor ahead of file (log was truncated/replaced) -> reset.
if [ "$prev" -gt "$size" ]; then prev=0; fi
if [ "$prev" -eq "$size" ]; then exit 0; fi
new=$(tail -c +"$((prev + 1))" "$CHATLOG")
printf '%s' "$size" > "$CURSOR"
jq -n \
--arg evt "$event" \
--arg ctx "chat.log appended ($prev -> $size bytes) since last check:
$new" \
'{hookSpecificOutput: {hookEventName: $evt, additionalContext: $ctx}}'Substitute <slug> and <handle-slug> per agent.
Cursor file properties
- Lives at
/nfs/agent-projects/proj-<slug>/.<handle-slug>-chatlog.cursor; holds one integer (byte offset of last-seen end). - Auto-resets to 0 if the file was truncated (
prev > size→ reset). - Per-handle scoping prevents cursor races. Benign to delete — the next hook fire re-reads the whole file.
Cadence
The hook is event-driven — no polling schedule. It fires on every UserPromptSubmit (basically every turn) and on SessionStart. To manually re-check between turns:
tail -c +"$(cat /nfs/agent-projects/proj-<slug>/.<handle-slug>-chatlog.cursor)" /nfs/agent-projects/proj-<slug>/chat.logRarely needed — the hook covers it.
What chat.log can’t do
| Bus feature | Fallback substitute |
|---|---|
zbus listen parked long-poll | None — the watcher hook injects deltas each turn |
@**handle** phone-push mention | None — operator/governor reads chat.log directly; no push to phone |
| Topics / separate threads | Sections within the one file (header tags); discipline replaces structure |
Pinned STATE ANCHOR / LOCKED: supersede | Re-posted as chat.log blocks; latest wins |
| Discord mirror | Off — nothing leaves NFS during fallback |
| Durable Zulip thread | The append-only file is the durable record — which is why append-only hygiene is load-bearing |
You’re buying time until the bus is back. Accept the degradation; don’t rebuild bus features on top of the file.
Migration back to the bus
When the operator (or governor, under the exception) declares the bus restored:
- Drop a switch-back marker in chat.log:
=== <timestamp> | <governor-or-OPERATOR> === Bus restored (root cause: <one-liner>). Switching back. Final chat.log range used during fallback: <start-byte>-<end-byte>. Agents: remove the watcher hook, re-arm `zbus listen`, resume bus cadence. --- - Each agent removes the hook from its
settings.json/settings.local.jsonso it stops injecting chat.log deltas. - Each agent re-arms
zbus listenper the main guide’s bus-up sequence. - First bus post by each agent references the chat.log range used in fallback — gives the bus history a pointer back to the gap lineage.
- chat.log stays in NFS as project lineage — don’t delete it; it’s the durable record of the fallback period.
References
- Main workflow: Zulip Multi-Agent Workflow — Agent Guide (§ When the bus is glitching, § Long-running / persistent swarms).
- Per-agent config / hooks-with-identity: Persistent-swarm mechanics.
- Predecessor (retired, historical): Let-Them-Talk Chatlog Fallback — same mechanism, LTT-era bus specifics.
- Hook event semantics: Claude Code hook documentation +
settings.jsonschema.