Skip to content

Dog Pool Architecture for Concurrent Shutdown Dances

Design document for gt-fsld8

Problem Statement

Boot needs to run multiple shutdown-dance molecules concurrently when multiple death warrants are issued. The current hook design only allows one molecule per agent.

Example scenario:

  • Warrant 1: Kill stuck polecat Toast (60s into interrogation)
  • Warrant 2: Kill stuck polecat Shadow (just started)
  • Warrant 3: Kill stuck witness (120s into interrogation)

All three need concurrent tracking, independent timeouts, and separate outcomes.

Design Decision: Lightweight State Machines

After analyzing the options, the shutdown-dance does NOT need Claude sessions. The dance is a deterministic state machine:

WARRANT -> INTERROGATE -> EVALUATE -> PARDON|EXECUTE

Each step is mechanical:

  1. Send a tmux message (no LLM needed)
  2. Wait for timeout or response (timer)
  3. Check tmux output for ALIVE keyword (string match)
  4. Repeat or terminate

Decision: Dogs are lightweight Go routines, not Claude sessions.

Architecture Overview

┌────────────────────────────────────────────────────────────────────┐
│ BOOT │
│ (Claude session in tmux) │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Dog Manager │ │
│ │ │ │
│ │ Pool: [Dog1, Dog2, Dog3, ...] (goroutines + state files) │ │
│ │ │ │
│ │ allocate() → Dog │ │
│ │ release(Dog) │ │
│ │ status() → []DogStatus │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ Boot's job: │
│ - Watch for warrants (file or event) │
│ - Allocate dog from pool │
│ - Monitor dog progress │
│ - Handle dog completion/failure │
│ - Report results │
└────────────────────────────────────────────────────────────────────┘

Dog Structure

// Dog represents a shutdown-dance executor
type Dog struct {
ID string // Unique ID (e.g., "dog-1704567890123")
Warrant *Warrant // The death warrant being processed
State ShutdownDanceState
Attempt int // Current interrogation attempt (1-3)
StartedAt time.Time
StateFile string // Persistent state: ~/gt/deacon/dogs/active/<id>.json
}
type ShutdownDanceState string
const (
StateIdle ShutdownDanceState = "idle"
StateInterrogating ShutdownDanceState = "interrogating" // Sent message, waiting
StateEvaluating ShutdownDanceState = "evaluating" // Checking response
StatePardoned ShutdownDanceState = "pardoned" // Session responded
StateExecuting ShutdownDanceState = "executing" // Killing session
StateComplete ShutdownDanceState = "complete" // Done, ready for cleanup
StateFailed ShutdownDanceState = "failed" // Dog crashed/errored
)
type Warrant struct {
ID string // Bead ID for the warrant
Target string // Session to interrogate (e.g., "gt-gastown-Toast")
Reason string // Why warrant was issued
Requester string // Who filed the warrant
FiledAt time.Time
}

Pool Design

Fixed Pool Size

Decision: Fixed pool of 5 dogs, configurable via environment.

Rationale:

  • Dynamic sizing adds complexity without clear benefit
  • 5 concurrent shutdown dances handles worst-case scenarios
  • If pool exhausted, warrants queue (better than infinite dog spawning)
  • Memory footprint is negligible (goroutines + small state files)
const (
DefaultPoolSize = 5
MaxPoolSize = 20
)
type DogPool struct {
mu sync.Mutex
dogs []*Dog // All dogs in pool
idle chan *Dog // Channel of available dogs
active map[string]*Dog // ID -> Dog for active dogs
stateDir string // ~/gt/deacon/dogs/active/
}
func (p *DogPool) Allocate(warrant *Warrant) (*Dog, error) {
select {
case dog := <-p.idle:
dog.Warrant = warrant
dog.State = StateInterrogating
dog.Attempt = 1
dog.StartedAt = time.Now()
p.active[dog.ID] = dog
return dog, nil
default:
return nil, ErrPoolExhausted
}
}
func (p *DogPool) Release(dog *Dog) {
p.mu.Lock()
defer p.mu.Unlock()
delete(p.active, dog.ID)
dog.Reset()
p.idle <- dog
}

Why Not Dynamic Pool?

Considered but rejected:

  • Adding dogs on demand increases complexity
  • No clear benefit - warrants rarely exceed 5 concurrent
  • If needed, raise DefaultPoolSize
  • Simpler to reason about fixed resources

Communication: State Files + Events

State Persistence

Each active dog writes state to ~/gt/deacon/dogs/active/<id>.json:

{
"id": "dog-1704567890123",
"warrant": {
"id": "gt-abc123",
"target": "gt-gastown-Toast",
"reason": "no_response_health_check",
"requester": "deacon",
"filed_at": "2026-01-07T20:15:00Z"
},
"state": "interrogating",
"attempt": 2,
"started_at": "2026-01-07T20:15:00Z",
"last_message_at": "2026-01-07T20:16:00Z",
"next_timeout": "2026-01-07T20:18:00Z"
}

Boot Monitoring

Boot monitors dogs via:

  1. Polling: gt dog status --active every tick
  2. Completion files: Dogs write <id>.done when complete
type DogResult struct {
DogID string
Warrant *Warrant
Outcome DogOutcome // pardoned | executed | failed
Duration time.Duration
Details string
}
type DogOutcome string
const (
OutcomePardoned DogOutcome = "pardoned" // Session responded
OutcomeExecuted DogOutcome = "executed" // Session killed
OutcomeFailed DogOutcome = "failed" // Dog crashed
)

Why Not Mail?

Considered but rejected for dog<->boot communication:

  • Mail is async, poll-based - adds latency
  • State files are simpler for local coordination
  • Dogs don’t need complex inter-agent communication
  • Keep mail for external coordination (Witness, Mayor)

Shutdown Dance State Machine

Each dog executes this state machine:

┌─────────────────────────────────────────┐
│ │
▼ │
┌───────────────────────────┐ │
│ INTERROGATING │ │
│ │ │
│ 1. Send health check │ │
│ 2. Start timeout timer │ │
└───────────┬───────────────┘ │
│ │
│ timeout or response │
▼ │
┌───────────────────────────┐ │
│ EVALUATING │ │
│ │ │
│ Check tmux output for │ │
│ ALIVE keyword │ │
└───────────┬───────────────┘ │
│ │
┌───────┴───────┐ │
│ │ │
▼ ▼ │
[ALIVE found] [No ALIVE] │
│ │ │
│ │ attempt < 3? │
│ ├──────────────────────────────────→─┘
│ │ yes: attempt++, longer timeout
│ │
│ │ no: attempt == 3
▼ ▼
┌─────────┐ ┌─────────────┐
│ PARDONED│ │ EXECUTING │
│ │ │ │
│ Cancel │ │ Kill tmux │
│ warrant │ │ session │
└────┬────┘ └──────┬──────┘
│ │
└────────┬───────┘
┌────────────────┐
│ COMPLETE │
│ │
│ Write result │
│ Release dog │
└────────────────┘

Timeout Gates

AttemptTimeoutCumulative Wait
160s60s
2120s180s (3 min)
3240s420s (7 min)

Health Check Message

[DOG] HEALTH CHECK: Session {target}, respond ALIVE within {timeout}s or face termination.
Warrant reason: {reason}
Filed by: {requester}
Attempt: {attempt}/3

Response Detection

func (d *Dog) CheckForResponse() bool {
tm := tmux.NewTmux()
output, err := tm.CapturePane(d.Warrant.Target, 50) // Last 50 lines
if err != nil {
return false
}
// Any output after our health check counts as alive
// Specifically look for ALIVE keyword for explicit response
return strings.Contains(output, "ALIVE")
}

Dog Implementation

Not Reusing Polecat Infrastructure

Decision: Dogs do NOT reuse polecat infrastructure.

Rationale:

  • Polecats are Claude sessions with molecules, hooks, sandboxes
  • Dogs are simple state machine executors
  • Polecats have 3-layer lifecycle (session/sandbox/slot)
  • Dogs have single-layer lifecycle (just state)
  • Different resource profiles, different management

What dogs DO share:

  • tmux utilities for message sending/capture
  • State file patterns
  • Name slot allocation pattern (pool of names, not instances)

Dog Execution Loop

func (d *Dog) Run(ctx context.Context) DogResult {
d.State = StateInterrogating
d.saveState()
for d.Attempt <= 3 {
// Send interrogation message
if err := d.sendHealthCheck(); err != nil {
return d.fail(err)
}
// Wait for timeout or context cancellation
timeout := d.timeoutForAttempt(d.Attempt)
select {
case <-ctx.Done():
return d.fail(ctx.Err())
case <-time.After(timeout):
// Timeout reached
}
// Evaluate response
d.State = StateEvaluating
d.saveState()
if d.CheckForResponse() {
// Session is alive
return d.pardon()
}
// No response - try again or execute
d.Attempt++
if d.Attempt <= 3 {
d.State = StateInterrogating
d.saveState()
}
}
// All attempts exhausted - execute warrant
return d.execute()
}

Failure Handling

Dog Crashes Mid-Dance

If a dog crashes (Boot process restarts, system crash):

  1. State files persist in ~/gt/deacon/dogs/active/
  2. On Boot restart, scan for orphaned state files
  3. Resume or restart based on state:
StateRecovery Action
interrogatingRestart from current attempt
evaluatingCheck response, continue
executingVerify kill, mark complete
pardoned/completeAlready done, clean up
func (p *DogPool) RecoverOrphans() error {
files, _ := filepath.Glob(p.stateDir + "/*.json")
for _, f := range files {
state := loadDogState(f)
if state.State != StateComplete && state.State != StatePardoned {
dog := p.allocateForRecovery(state)
go dog.Resume()
}
}
return nil
}

Handling Pool Exhaustion

If all dogs are busy when new warrant arrives:

func (b *Boot) HandleWarrant(warrant *Warrant) error {
dog, err := b.pool.Allocate(warrant)
if err == ErrPoolExhausted {
// Queue the warrant for later processing
b.warrantQueue.Push(warrant)
b.log("Warrant %s queued (pool exhausted)", warrant.ID)
return nil
}
go func() {
result := dog.Run(b.ctx)
b.handleResult(result)
b.pool.Release(dog)
// Check queue for pending warrants
if next := b.warrantQueue.Pop(); next != nil {
b.HandleWarrant(next)
}
}()
return nil
}

Directory Structure

~/gt/deacon/dogs/
├── boot/ # Boot's working directory
│ ├── CLAUDE.md # Boot context
│ └── .boot-status.json # Boot execution status
├── active/ # Active dog state files
│ ├── dog-123.json # Dog 1 state
│ ├── dog-456.json # Dog 2 state
│ └── ...
├── completed/ # Completed dance records (for audit)
│ ├── dog-789.json # Historical record
│ └── ...
└── warrants/ # Pending warrant queue
├── warrant-abc.json
└── ...

Command Interface

Terminal window
# Pool status
gt dog pool status
# Output:
# Dog Pool: 3/5 active
# dog-123: interrogating Toast (attempt 2, 45s remaining)
# dog-456: executing Shadow
# dog-789: idle
# Manual dog operations (for debugging)
gt dog pool allocate <warrant-id>
gt dog pool release <dog-id>
# View active dances
gt dog dances
# Output:
# Active Shutdown Dances:
# dog-123 → Toast: Interrogating (2/3), timeout in 45s
# dog-456 → Shadow: Executing warrant
# View warrant queue
gt dog warrants
# Output:
# Pending Warrants: 2
# 1. gt-abc: witness-gastown (stuck_no_progress)
# 2. gt-def: polecat-Copper (crash_loop)

Integration with Existing Dogs

The existing dog package (internal/dog/) manages Deacon’s multi-rig helper dogs. Those are different from shutdown-dance dogs:

AspectHelper Dogs (existing)Dance Dogs (new)
PurposeCross-rig infrastructureShutdown dance execution
SessionsClaude sessionsGoroutines (no Claude)
WorktreesOne per rigNone
LifecycleLong-lived, reusableEphemeral per warrant
Stateidle/workingDance state machine

Recommendation: Use different package to avoid confusion:

  • internal/dog/ - existing helper dogs
  • internal/shutdown/ - shutdown dance pool

Summary: Answers to Design Questions

QuestionAnswer
How many Dogs in pool?Fixed: 5 (configurable via GT_DOG_POOL_SIZE)
How do Dogs communicate with Boot?State files + completion markers
Are Dogs tmux sessions?No - goroutines with state machine
Reuse polecat infrastructure?No - too heavyweight, different model
What if Dog dies mid-dance?State file recovery on Boot restart

Acceptance Criteria

  • Architecture document for Dog pool
  • Clear allocation/deallocation protocol
  • Failure handling for Dog crashes