Daemon/Boot/Deacon Watchdog Chain
Autonomous health monitoring and recovery in Gas Town.
Overview
Section titled “Overview”Gas Town uses a three-tier watchdog chain for autonomous health monitoring:
Daemon (Go process) ← Dumb transport, 3-min heartbeat │ └─► Boot (AI agent) ← Intelligent triage, fresh each tick │ └─► Deacon (AI agent) ← Continuous patrol, long-running │ └─► Witnesses & Refineries ← Per-rig agentsKey insight: The daemon is mechanical (can’t reason), but health decisions need intelligence (is the agent stuck or just thinking?). Boot bridges this gap.
Design Rationale: Why Two Agents?
Section titled “Design Rationale: Why Two Agents?”The Problem
Section titled “The Problem”The daemon needs to ensure the Deacon is healthy, but:
-
Daemon can’t reason - It’s Go code following the ZFC principle (don’t reason about other agents). It can check “is session alive?” but not “is agent stuck?”
-
Waking costs context - Each time you spawn an AI agent, you consume context tokens. In idle towns, waking Deacon every 3 minutes wastes resources.
-
Observation requires intelligence - Distinguishing “agent composing large artifact” from “agent hung on tool prompt” requires reasoning.
The Solution: Boot as Triage
Section titled “The Solution: Boot as Triage”Boot is a narrow, ephemeral AI agent that:
- Runs fresh each daemon tick (no accumulated context debt)
- Makes a single decision: should Deacon wake?
- Exits immediately after deciding
This gives us intelligent triage without the cost of keeping a full AI running.
Why Not Merge Boot into Deacon?
Section titled “Why Not Merge Boot into Deacon?”We could have Deacon handle its own “should I be awake?” logic, but:
- Deacon can’t observe itself - A hung Deacon can’t detect it’s hung
- Context accumulation - Deacon runs continuously; Boot restarts fresh
- Cost in idle towns - Boot only costs tokens when it runs; Deacon costs tokens constantly if kept alive
Why Not Replace with Go Code?
Section titled “Why Not Replace with Go Code?”The daemon could directly monitor agents without AI, but:
- Can’t observe panes - Go code can’t interpret tmux output semantically
- Can’t distinguish stuck vs working - No reasoning about agent state
- Escalation is complex - When to notify? When to force-restart? AI handles nuanced decisions better than hardcoded thresholds
Session Ownership
Section titled “Session Ownership”| Agent | Session Name | Location | Lifecycle |
|---|---|---|---|
| Daemon | (Go process) | ~/gt/daemon/ | Persistent, auto-restart |
| Boot | gt-boot | ~/gt/deacon/dogs/boot/ | Ephemeral, fresh each tick |
| Deacon | hq-deacon | ~/gt/deacon/ | Long-running, handoff loop |
Critical: Boot runs in gt-boot, NOT hq-deacon. This prevents Boot
from conflicting with a running Deacon session.
Heartbeat Mechanics
Section titled “Heartbeat Mechanics”Daemon Heartbeat (3 minutes)
Section titled “Daemon Heartbeat (3 minutes)”The daemon runs a heartbeat tick every 3 minutes:
func (d *Daemon) heartbeatTick() { d.ensureBootRunning() // 1. Spawn Boot for triage d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback d.ensureWitnessesRunning() // 3. Witness health (checks tmux directly) d.ensureRefineriesRunning() // 4. Refinery health (checks tmux directly) d.triggerPendingSpawns() // 5. Bootstrap polecats d.processLifecycleRequests() // 6. Cycle/restart requests // Agent state derived from tmux, not recorded in beads (gt-zecmc)}Deacon Heartbeat (continuous)
Section titled “Deacon Heartbeat (continuous)”The Deacon updates ~/gt/deacon/heartbeat.json at the start of each patrol cycle:
{ "timestamp": "2026-01-02T18:30:00Z", "cycle": 42, "last_action": "health-scan", "healthy_agents": 3, "unhealthy_agents": 0}Heartbeat Freshness
Section titled “Heartbeat Freshness”| Age | State | Boot Action |
|---|---|---|
| < 5 min | Fresh | Nothing (Deacon active) |
| 5-15 min | Stale | Nudge if pending mail |
| > 15 min | Very stale | Wake (Deacon may be stuck) |
Boot Decision Matrix
Section titled “Boot Decision Matrix”When Boot runs, it observes:
- Is Deacon session alive?
- How old is Deacon’s heartbeat?
- Is there pending mail for Deacon?
- What’s in Deacon’s tmux pane?
Then decides:
| Condition | Action | Command |
|---|---|---|
| Session dead | START | Exit; daemon calls ensureDeaconRunning() |
| Heartbeat > 15 min | WAKE | gt nudge deacon "Boot wake: check your inbox" |
| Heartbeat 5-15 min + mail | NUDGE | gt nudge deacon "Boot check-in: pending work" |
| Heartbeat fresh | NOTHING | Exit silently |
Handoff Flow
Section titled “Handoff Flow”Deacon Handoff
Section titled “Deacon Handoff”The Deacon runs continuous patrol cycles. After N cycles or high context:
End of patrol cycle: │ ├─ Squash wisp to digest (ephemeral → permanent) ├─ Write summary to molecule state └─ gt handoff -s "Routine cycle" -m "Details" │ └─ Creates mail for next sessionNext daemon tick:
Daemon → ensureDeaconRunning() │ └─ Spawns fresh Deacon in gt-deacon │ └─ SessionStart hook: gt mail check --inject │ └─ Previous handoff mail injected │ └─ Deacon reads and continuesBoot Handoff (Rare)
Section titled “Boot Handoff (Rare)”Boot is ephemeral - it exits after each tick. No persistent handoff needed.
However, Boot uses a marker file to prevent double-spawning:
- Marker:
~/gt/deacon/dogs/boot/.boot-running(TTL: 5 minutes) - Status:
~/gt/deacon/dogs/boot/.boot-status.json(last action/result)
If the marker exists and is recent, daemon skips Boot spawn for that tick.
Degraded Mode
Section titled “Degraded Mode”When tmux is unavailable, Gas Town enters degraded mode:
| Capability | Normal | Degraded |
|---|---|---|
| Boot runs | As AI in tmux | As Go code (mechanical) |
| Observe panes | Yes | No |
| Nudge agents | Yes | No |
| Start agents | tmux sessions | Direct spawn |
Degraded Boot triage is purely mechanical:
- Session dead → start
- Heartbeat stale → restart
- No reasoning, just thresholds
Fallback Chain
Section titled “Fallback Chain”Multiple layers ensure recovery:
- Boot triage - Intelligent observation, first line
- Daemon checkDeaconHeartbeat() - Belt-and-suspenders if Boot fails
- Tmux-based discovery - Daemon checks tmux sessions directly (no bead state)
- Human escalation - Mail to overseer for unrecoverable states
State Files
Section titled “State Files”| File | Purpose | Updated By |
|---|---|---|
deacon/heartbeat.json | Deacon freshness | Deacon (each cycle) |
deacon/dogs/boot/.boot-running | Boot in-progress marker | Boot spawn |
deacon/dogs/boot/.boot-status.json | Boot last action | Boot triage |
deacon/health-check-state.json | Agent health tracking | gt deacon health-check |
daemon/daemon.log | Daemon activity | Daemon |
daemon/daemon.pid | Daemon process ID | Daemon startup |
Debugging
Section titled “Debugging”# Check Deacon heartbeatcat ~/gt/deacon/heartbeat.json | jq .
# Check Boot statuscat ~/gt/deacon/dogs/boot/.boot-status.json | jq .
# View daemon logtail -f ~/gt/daemon/daemon.log
# Manual Boot rungt boot triage
# Manual Deacon health checkgt deacon health-checkCommon Issues
Section titled “Common Issues”Boot Spawns in Wrong Session
Section titled “Boot Spawns in Wrong Session”Symptom: Boot runs in hq-deacon instead of gt-boot
Cause: Session name confusion in spawn code
Fix: Ensure gt boot triage specifies --session=gt-boot
Zombie Sessions Block Restart
Section titled “Zombie Sessions Block Restart”Symptom: tmux session exists but Claude is dead
Cause: Daemon checks session existence, not process health
Fix: Kill zombie sessions before recreating: gt session kill hq-deacon
Status Shows Wrong State
Section titled “Status Shows Wrong State”Symptom: gt status shows wrong state for agents
Cause: Previously bead state and tmux state could diverge
Fix: As of gt-zecmc, status derives state from tmux directly (no bead state for
observable conditions like running/stopped). Non-observable states (stuck, awaiting-gate)
are still stored in beads.
Design Decision: Keep Separation
Section titled “Design Decision: Keep Separation”The issue [gt-1847v] considered three options:
Option A: Keep Boot/Deacon Separation (CHOSEN)
Section titled “Option A: Keep Boot/Deacon Separation (CHOSEN)”- Boot is ephemeral, spawns fresh each heartbeat
- Boot runs in
gt-boot, exits after triage - Deacon runs in
hq-deacon, continuous patrol - Clear session boundaries, clear lifecycle
Verdict: This is the correct design. The implementation needs fixing, not the architecture.
Option B: Merge Boot into Deacon (Rejected)
Section titled “Option B: Merge Boot into Deacon (Rejected)”- Single
hq-deaconsession handles everything - Deacon checks “should I be awake?” internally
Why rejected:
- Deacon can’t observe itself (hung Deacon can’t detect hang)
- Context accumulates even when idle (cost in quiet towns)
- No external watchdog means no recovery from Deacon failure
Option C: Replace with Go Watchdog (Rejected)
Section titled “Option C: Replace with Go Watchdog (Rejected)”- Daemon directly monitors witness/refinery
- No Boot, no Deacon AI for health checks
- AI agents only for complex decisions
Why rejected:
- Go code can’t interpret tmux pane output semantically
- Can’t distinguish “stuck” from “thinking deeply”
- Loses the intelligent triage that makes the system resilient
- Escalation decisions are nuanced (when to notify? force-restart?)
Implementation Fixes Needed
Section titled “Implementation Fixes Needed”The separation is correct; these bugs need fixing:
- Session confusion (gt-sgzsb): Boot spawns in wrong session
- Zombie blocking (gt-j1i0r): Daemon can’t kill zombie sessions
Status mismatch (gt-doih4): Bead vs tmux state divergence→ FIXED in gt-zecmc- Ensure semantics (gt-ekc5u): Start should kill zombies first
Summary
Section titled “Summary”The watchdog chain provides autonomous recovery:
- Daemon: Mechanical heartbeat, spawns Boot
- Boot: Intelligent triage, decides Deacon fate
- Deacon: Continuous patrol, monitors workers
Boot exists because the daemon can’t reason and Deacon can’t observe itself. The separation costs complexity but enables:
- Intelligent triage without constant AI cost
- Fresh context for each triage decision
- Graceful degradation when tmux unavailable
- Multiple fallback layers for reliability
