Dog Pool Architecture for Concurrent Shutdown Dances
Design document for gt-fsld8
Problem Statement
Boot needs to run multiple shutdown-dance molecules concurrently when multiple death warrants are issued. The current hook design only allows one molecule per agent.
Example scenario:
- Warrant 1: Kill stuck polecat Toast (60s into interrogation)
- Warrant 2: Kill stuck polecat Shadow (just started)
- Warrant 3: Kill stuck witness (120s into interrogation)
All three need concurrent tracking, independent timeouts, and separate outcomes.
Design Decision: Lightweight State Machines
After analyzing the options, the shutdown-dance does NOT need Claude sessions. The dance is a deterministic state machine:
WARRANT -> INTERROGATE -> EVALUATE -> PARDON|EXECUTEEach step is mechanical:
- Send a tmux message (no LLM needed)
- Wait for timeout or response (timer)
- Check tmux output for ALIVE keyword (string match)
- Repeat or terminate
Decision: Dogs are lightweight Go routines, not Claude sessions.
Architecture Overview
┌────────────────────────────────────────────────────────────────────┐│ BOOT ││ (Claude session in tmux) ││ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ Dog Manager │ ││ │ │ ││ │ Pool: [Dog1, Dog2, Dog3, ...] (goroutines + state files) │ ││ │ │ ││ │ allocate() → Dog │ ││ │ release(Dog) │ ││ │ status() → []DogStatus │ ││ └──────────────────────────────────────────────────────────────┘ ││ ││ Boot's job: ││ - Watch for warrants (file or event) ││ - Allocate dog from pool ││ - Monitor dog progress ││ - Handle dog completion/failure ││ - Report results │└────────────────────────────────────────────────────────────────────┘Dog Structure
// Dog represents a shutdown-dance executortype Dog struct { ID string // Unique ID (e.g., "dog-1704567890123") Warrant *Warrant // The death warrant being processed State ShutdownDanceState Attempt int // Current interrogation attempt (1-3) StartedAt time.Time StateFile string // Persistent state: ~/gt/deacon/dogs/active/<id>.json}
type ShutdownDanceState string
const ( StateIdle ShutdownDanceState = "idle" StateInterrogating ShutdownDanceState = "interrogating" // Sent message, waiting StateEvaluating ShutdownDanceState = "evaluating" // Checking response StatePardoned ShutdownDanceState = "pardoned" // Session responded StateExecuting ShutdownDanceState = "executing" // Killing session StateComplete ShutdownDanceState = "complete" // Done, ready for cleanup StateFailed ShutdownDanceState = "failed" // Dog crashed/errored)
type Warrant struct { ID string // Bead ID for the warrant Target string // Session to interrogate (e.g., "gt-gastown-Toast") Reason string // Why warrant was issued Requester string // Who filed the warrant FiledAt time.Time}Pool Design
Fixed Pool Size
Decision: Fixed pool of 5 dogs, configurable via environment.
Rationale:
- Dynamic sizing adds complexity without clear benefit
- 5 concurrent shutdown dances handles worst-case scenarios
- If pool exhausted, warrants queue (better than infinite dog spawning)
- Memory footprint is negligible (goroutines + small state files)
const ( DefaultPoolSize = 5 MaxPoolSize = 20)
type DogPool struct { mu sync.Mutex dogs []*Dog // All dogs in pool idle chan *Dog // Channel of available dogs active map[string]*Dog // ID -> Dog for active dogs stateDir string // ~/gt/deacon/dogs/active/}
func (p *DogPool) Allocate(warrant *Warrant) (*Dog, error) { select { case dog := <-p.idle: dog.Warrant = warrant dog.State = StateInterrogating dog.Attempt = 1 dog.StartedAt = time.Now() p.active[dog.ID] = dog return dog, nil default: return nil, ErrPoolExhausted }}
func (p *DogPool) Release(dog *Dog) { p.mu.Lock() defer p.mu.Unlock() delete(p.active, dog.ID) dog.Reset() p.idle <- dog}Why Not Dynamic Pool?
Considered but rejected:
- Adding dogs on demand increases complexity
- No clear benefit - warrants rarely exceed 5 concurrent
- If needed, raise DefaultPoolSize
- Simpler to reason about fixed resources
Communication: State Files + Events
State Persistence
Each active dog writes state to ~/gt/deacon/dogs/active/<id>.json:
{ "id": "dog-1704567890123", "warrant": { "id": "gt-abc123", "target": "gt-gastown-Toast", "reason": "no_response_health_check", "requester": "deacon", "filed_at": "2026-01-07T20:15:00Z" }, "state": "interrogating", "attempt": 2, "started_at": "2026-01-07T20:15:00Z", "last_message_at": "2026-01-07T20:16:00Z", "next_timeout": "2026-01-07T20:18:00Z"}Boot Monitoring
Boot monitors dogs via:
- Polling:
gt dog status --activeevery tick - Completion files: Dogs write
<id>.donewhen complete
type DogResult struct { DogID string Warrant *Warrant Outcome DogOutcome // pardoned | executed | failed Duration time.Duration Details string}
type DogOutcome string
const ( OutcomePardoned DogOutcome = "pardoned" // Session responded OutcomeExecuted DogOutcome = "executed" // Session killed OutcomeFailed DogOutcome = "failed" // Dog crashed)Why Not Mail?
Considered but rejected for dog<->boot communication:
- Mail is async, poll-based - adds latency
- State files are simpler for local coordination
- Dogs don’t need complex inter-agent communication
- Keep mail for external coordination (Witness, Mayor)
Shutdown Dance State Machine
Each dog executes this state machine:
┌─────────────────────────────────────────┐ │ │ ▼ │ ┌───────────────────────────┐ │ │ INTERROGATING │ │ │ │ │ │ 1. Send health check │ │ │ 2. Start timeout timer │ │ └───────────┬───────────────┘ │ │ │ │ timeout or response │ ▼ │ ┌───────────────────────────┐ │ │ EVALUATING │ │ │ │ │ │ Check tmux output for │ │ │ ALIVE keyword │ │ └───────────┬───────────────┘ │ │ │ ┌───────┴───────┐ │ │ │ │ ▼ ▼ │ [ALIVE found] [No ALIVE] │ │ │ │ │ │ attempt < 3? │ │ ├──────────────────────────────────→─┘ │ │ yes: attempt++, longer timeout │ │ │ │ no: attempt == 3 ▼ ▼ ┌─────────┐ ┌─────────────┐ │ PARDONED│ │ EXECUTING │ │ │ │ │ │ Cancel │ │ Kill tmux │ │ warrant │ │ session │ └────┬────┘ └──────┬──────┘ │ │ └────────┬───────┘ │ ▼ ┌────────────────┐ │ COMPLETE │ │ │ │ Write result │ │ Release dog │ └────────────────┘Timeout Gates
| Attempt | Timeout | Cumulative Wait |
|---|---|---|
| 1 | 60s | 60s |
| 2 | 120s | 180s (3 min) |
| 3 | 240s | 420s (7 min) |
Health Check Message
[DOG] HEALTH CHECK: Session {target}, respond ALIVE within {timeout}s or face termination.Warrant reason: {reason}Filed by: {requester}Attempt: {attempt}/3Response Detection
func (d *Dog) CheckForResponse() bool { tm := tmux.NewTmux() output, err := tm.CapturePane(d.Warrant.Target, 50) // Last 50 lines if err != nil { return false }
// Any output after our health check counts as alive // Specifically look for ALIVE keyword for explicit response return strings.Contains(output, "ALIVE")}Dog Implementation
Not Reusing Polecat Infrastructure
Decision: Dogs do NOT reuse polecat infrastructure.
Rationale:
- Polecats are Claude sessions with molecules, hooks, sandboxes
- Dogs are simple state machine executors
- Polecats have 3-layer lifecycle (session/sandbox/slot)
- Dogs have single-layer lifecycle (just state)
- Different resource profiles, different management
What dogs DO share:
- tmux utilities for message sending/capture
- State file patterns
- Name slot allocation pattern (pool of names, not instances)
Dog Execution Loop
func (d *Dog) Run(ctx context.Context) DogResult { d.State = StateInterrogating d.saveState()
for d.Attempt <= 3 { // Send interrogation message if err := d.sendHealthCheck(); err != nil { return d.fail(err) }
// Wait for timeout or context cancellation timeout := d.timeoutForAttempt(d.Attempt) select { case <-ctx.Done(): return d.fail(ctx.Err()) case <-time.After(timeout): // Timeout reached }
// Evaluate response d.State = StateEvaluating d.saveState()
if d.CheckForResponse() { // Session is alive return d.pardon() }
// No response - try again or execute d.Attempt++ if d.Attempt <= 3 { d.State = StateInterrogating d.saveState() } }
// All attempts exhausted - execute warrant return d.execute()}Failure Handling
Dog Crashes Mid-Dance
If a dog crashes (Boot process restarts, system crash):
- State files persist in
~/gt/deacon/dogs/active/ - On Boot restart, scan for orphaned state files
- Resume or restart based on state:
| State | Recovery Action |
|---|---|
| interrogating | Restart from current attempt |
| evaluating | Check response, continue |
| executing | Verify kill, mark complete |
| pardoned/complete | Already done, clean up |
func (p *DogPool) RecoverOrphans() error { files, _ := filepath.Glob(p.stateDir + "/*.json") for _, f := range files { state := loadDogState(f) if state.State != StateComplete && state.State != StatePardoned { dog := p.allocateForRecovery(state) go dog.Resume() } } return nil}Handling Pool Exhaustion
If all dogs are busy when new warrant arrives:
func (b *Boot) HandleWarrant(warrant *Warrant) error { dog, err := b.pool.Allocate(warrant) if err == ErrPoolExhausted { // Queue the warrant for later processing b.warrantQueue.Push(warrant) b.log("Warrant %s queued (pool exhausted)", warrant.ID) return nil }
go func() { result := dog.Run(b.ctx) b.handleResult(result) b.pool.Release(dog)
// Check queue for pending warrants if next := b.warrantQueue.Pop(); next != nil { b.HandleWarrant(next) } }()
return nil}Directory Structure
~/gt/deacon/dogs/├── boot/ # Boot's working directory│ ├── CLAUDE.md # Boot context│ └── .boot-status.json # Boot execution status├── active/ # Active dog state files│ ├── dog-123.json # Dog 1 state│ ├── dog-456.json # Dog 2 state│ └── ...├── completed/ # Completed dance records (for audit)│ ├── dog-789.json # Historical record│ └── ...└── warrants/ # Pending warrant queue ├── warrant-abc.json └── ...Command Interface
# Pool statusgt dog pool status# Output:# Dog Pool: 3/5 active# dog-123: interrogating Toast (attempt 2, 45s remaining)# dog-456: executing Shadow# dog-789: idle
# Manual dog operations (for debugging)gt dog pool allocate <warrant-id>gt dog pool release <dog-id>
# View active dancesgt dog dances# Output:# Active Shutdown Dances:# dog-123 → Toast: Interrogating (2/3), timeout in 45s# dog-456 → Shadow: Executing warrant
# View warrant queuegt dog warrants# Output:# Pending Warrants: 2# 1. gt-abc: witness-gastown (stuck_no_progress)# 2. gt-def: polecat-Copper (crash_loop)Integration with Existing Dogs
The existing dog package (internal/dog/) manages Deacon’s multi-rig helper dogs.
Those are different from shutdown-dance dogs:
| Aspect | Helper Dogs (existing) | Dance Dogs (new) |
|---|---|---|
| Purpose | Cross-rig infrastructure | Shutdown dance execution |
| Sessions | Claude sessions | Goroutines (no Claude) |
| Worktrees | One per rig | None |
| Lifecycle | Long-lived, reusable | Ephemeral per warrant |
| State | idle/working | Dance state machine |
Recommendation: Use different package to avoid confusion:
internal/dog/- existing helper dogsinternal/shutdown/- shutdown dance pool
Summary: Answers to Design Questions
| Question | Answer |
|---|---|
| How many Dogs in pool? | Fixed: 5 (configurable via GT_DOG_POOL_SIZE) |
| How do Dogs communicate with Boot? | State files + completion markers |
| Are Dogs tmux sessions? | No - goroutines with state machine |
| Reuse polecat infrastructure? | No - too heavyweight, different model |
| What if Dog dies mid-dance? | State file recovery on Boot restart |
Acceptance Criteria
- Architecture document for Dog pool
- Clear allocation/deallocation protocol
- Failure handling for Dog crashes