Skip to content

Daemon/Boot/Deacon Watchdog Chain

Autonomous health monitoring and recovery in Gas Town.

Gas Town uses a three-tier watchdog chain for autonomous health monitoring:

Daemon (Go process) ← Dumb transport, 3-min heartbeat
└─► Boot (AI agent) ← Intelligent triage, fresh each tick
└─► Deacon (AI agent) ← Continuous patrol, long-running
└─► Witnesses & Refineries ← Per-rig agents

Key insight: The daemon is mechanical (can’t reason), but health decisions need intelligence (is the agent stuck or just thinking?). Boot bridges this gap.

The daemon needs to ensure the Deacon is healthy, but:

  1. Daemon can’t reason - It’s Go code following the ZFC principle (don’t reason about other agents). It can check “is session alive?” but not “is agent stuck?”

  2. Waking costs context - Each time you spawn an AI agent, you consume context tokens. In idle towns, waking Deacon every 3 minutes wastes resources.

  3. Observation requires intelligence - Distinguishing “agent composing large artifact” from “agent hung on tool prompt” requires reasoning.

Boot is a narrow, ephemeral AI agent that:

  • Runs fresh each daemon tick (no accumulated context debt)
  • Makes a single decision: should Deacon wake?
  • Exits immediately after deciding

This gives us intelligent triage without the cost of keeping a full AI running.

We could have Deacon handle its own “should I be awake?” logic, but:

  1. Deacon can’t observe itself - A hung Deacon can’t detect it’s hung
  2. Context accumulation - Deacon runs continuously; Boot restarts fresh
  3. Cost in idle towns - Boot only costs tokens when it runs; Deacon costs tokens constantly if kept alive

The daemon could directly monitor agents without AI, but:

  1. Can’t observe panes - Go code can’t interpret tmux output semantically
  2. Can’t distinguish stuck vs working - No reasoning about agent state
  3. Escalation is complex - When to notify? When to force-restart? AI handles nuanced decisions better than hardcoded thresholds
AgentSession NameLocationLifecycle
Daemon(Go process)~/gt/daemon/Persistent, auto-restart
Bootgt-boot~/gt/deacon/dogs/boot/Ephemeral, fresh each tick
Deaconhq-deacon~/gt/deacon/Long-running, handoff loop

Critical: Boot runs in gt-boot, NOT hq-deacon. This prevents Boot from conflicting with a running Deacon session.

The daemon runs a heartbeat tick every 3 minutes:

func (d *Daemon) heartbeatTick() {
d.ensureBootRunning() // 1. Spawn Boot for triage
d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback
d.ensureWitnessesRunning() // 3. Witness health (checks tmux directly)
d.ensureRefineriesRunning() // 4. Refinery health (checks tmux directly)
d.triggerPendingSpawns() // 5. Bootstrap polecats
d.processLifecycleRequests() // 6. Cycle/restart requests
// Agent state derived from tmux, not recorded in beads (gt-zecmc)
}

The Deacon updates ~/gt/deacon/heartbeat.json at the start of each patrol cycle:

{
"timestamp": "2026-01-02T18:30:00Z",
"cycle": 42,
"last_action": "health-scan",
"healthy_agents": 3,
"unhealthy_agents": 0
}
AgeStateBoot Action
< 5 minFreshNothing (Deacon active)
5-15 minStaleNudge if pending mail
> 15 minVery staleWake (Deacon may be stuck)

When Boot runs, it observes:

  • Is Deacon session alive?
  • How old is Deacon’s heartbeat?
  • Is there pending mail for Deacon?
  • What’s in Deacon’s tmux pane?

Then decides:

ConditionActionCommand
Session deadSTARTExit; daemon calls ensureDeaconRunning()
Heartbeat > 15 minWAKEgt nudge deacon "Boot wake: check your inbox"
Heartbeat 5-15 min + mailNUDGEgt nudge deacon "Boot check-in: pending work"
Heartbeat freshNOTHINGExit silently

The Deacon runs continuous patrol cycles. After N cycles or high context:

End of patrol cycle:
├─ Squash wisp to digest (ephemeral → permanent)
├─ Write summary to molecule state
└─ gt handoff -s "Routine cycle" -m "Details"
└─ Creates mail for next session

Next daemon tick:

Daemon → ensureDeaconRunning()
└─ Spawns fresh Deacon in gt-deacon
└─ SessionStart hook: gt mail check --inject
└─ Previous handoff mail injected
└─ Deacon reads and continues

Boot is ephemeral - it exits after each tick. No persistent handoff needed.

However, Boot uses a marker file to prevent double-spawning:

  • Marker: ~/gt/deacon/dogs/boot/.boot-running (TTL: 5 minutes)
  • Status: ~/gt/deacon/dogs/boot/.boot-status.json (last action/result)

If the marker exists and is recent, daemon skips Boot spawn for that tick.

When tmux is unavailable, Gas Town enters degraded mode:

CapabilityNormalDegraded
Boot runsAs AI in tmuxAs Go code (mechanical)
Observe panesYesNo
Nudge agentsYesNo
Start agentstmux sessionsDirect spawn

Degraded Boot triage is purely mechanical:

  • Session dead → start
  • Heartbeat stale → restart
  • No reasoning, just thresholds

Multiple layers ensure recovery:

  1. Boot triage - Intelligent observation, first line
  2. Daemon checkDeaconHeartbeat() - Belt-and-suspenders if Boot fails
  3. Tmux-based discovery - Daemon checks tmux sessions directly (no bead state)
  4. Human escalation - Mail to overseer for unrecoverable states
FilePurposeUpdated By
deacon/heartbeat.jsonDeacon freshnessDeacon (each cycle)
deacon/dogs/boot/.boot-runningBoot in-progress markerBoot spawn
deacon/dogs/boot/.boot-status.jsonBoot last actionBoot triage
deacon/health-check-state.jsonAgent health trackinggt deacon health-check
daemon/daemon.logDaemon activityDaemon
daemon/daemon.pidDaemon process IDDaemon startup
Terminal window
# Check Deacon heartbeat
cat ~/gt/deacon/heartbeat.json | jq .
# Check Boot status
cat ~/gt/deacon/dogs/boot/.boot-status.json | jq .
# View daemon log
tail -f ~/gt/daemon/daemon.log
# Manual Boot run
gt boot triage
# Manual Deacon health check
gt deacon health-check

Symptom: Boot runs in hq-deacon instead of gt-boot Cause: Session name confusion in spawn code Fix: Ensure gt boot triage specifies --session=gt-boot

Symptom: tmux session exists but Claude is dead Cause: Daemon checks session existence, not process health Fix: Kill zombie sessions before recreating: gt session kill hq-deacon

Symptom: gt status shows wrong state for agents Cause: Previously bead state and tmux state could diverge Fix: As of gt-zecmc, status derives state from tmux directly (no bead state for observable conditions like running/stopped). Non-observable states (stuck, awaiting-gate) are still stored in beads.

The issue [gt-1847v] considered three options:

Option A: Keep Boot/Deacon Separation (CHOSEN)

Section titled “Option A: Keep Boot/Deacon Separation (CHOSEN)”
  • Boot is ephemeral, spawns fresh each heartbeat
  • Boot runs in gt-boot, exits after triage
  • Deacon runs in hq-deacon, continuous patrol
  • Clear session boundaries, clear lifecycle

Verdict: This is the correct design. The implementation needs fixing, not the architecture.

Option B: Merge Boot into Deacon (Rejected)

Section titled “Option B: Merge Boot into Deacon (Rejected)”
  • Single hq-deacon session handles everything
  • Deacon checks “should I be awake?” internally

Why rejected:

  • Deacon can’t observe itself (hung Deacon can’t detect hang)
  • Context accumulates even when idle (cost in quiet towns)
  • No external watchdog means no recovery from Deacon failure

Option C: Replace with Go Watchdog (Rejected)

Section titled “Option C: Replace with Go Watchdog (Rejected)”
  • Daemon directly monitors witness/refinery
  • No Boot, no Deacon AI for health checks
  • AI agents only for complex decisions

Why rejected:

  • Go code can’t interpret tmux pane output semantically
  • Can’t distinguish “stuck” from “thinking deeply”
  • Loses the intelligent triage that makes the system resilient
  • Escalation decisions are nuanced (when to notify? force-restart?)

The separation is correct; these bugs need fixing:

  1. Session confusion (gt-sgzsb): Boot spawns in wrong session
  2. Zombie blocking (gt-j1i0r): Daemon can’t kill zombie sessions
  3. Status mismatch (gt-doih4): Bead vs tmux state divergence → FIXED in gt-zecmc
  4. Ensure semantics (gt-ekc5u): Start should kill zombies first

The watchdog chain provides autonomous recovery:

  • Daemon: Mechanical heartbeat, spawns Boot
  • Boot: Intelligent triage, decides Deacon fate
  • Deacon: Continuous patrol, monitors workers

Boot exists because the daemon can’t reason and Deacon can’t observe itself. The separation costs complexity but enables:

  1. Intelligent triage without constant AI cost
  2. Fresh context for each triage decision
  3. Graceful degradation when tmux unavailable
  4. Multiple fallback layers for reliability