overnight-agent headless-mode claude-cli tmux cron process-management rate-limiting circuit-breaker

Headless Claude Agent: Multi-Pass Research Loop and Process Lifecycle Fixes

Headless Claude Agent: Multi-Pass Research and Lifecycle Fixes

Six bugs discovered and fixed during the first real overnight agent run. All relate to running Claude Code in headless mode (claude -p) for long autonomous sessions.

1. SKILL.md Not Loaded in Headless Mode

Symptom: Agent answers the research question in one shot (3 minutes) instead of following the exploration loop.

Root cause: claude -p "$TASK" passes a one-shot prompt. The skill system (SKILL.md with YAML frontmatter) is not activated in headless mode. The agent never sees the behavioral instructions.

Fix: Strip YAML frontmatter from SKILL.md with awk and embed the body into the workspace CLAUDE.md, which Claude Code auto-reads on startup.

# Extract skill body (strip YAML frontmatter)
SKILL_BODY=$(awk 'BEGIN{fm=0} /^---$/{fm++; next} fm>=2{print}' "$SKILL_FILE")

# Append to workspace CLAUDE.md
cat > "$WORKSPACE/CLAUDE.md" << CLAUDEEOF
# Task
$TASK
...
---
$SKILL_BODY
CLAUDEEOF

The run.sh prompt changed from the bare task to:

claude -p "You are an overnight research agent. Read CLAUDE.md in this directory for your task, behavioral instructions, end time, and branch budget. Follow the exploration loop exactly as described. Start by reading CLAUDE.md now."

Gotcha: BSD sed (macOS) can’t strip YAML frontmatter reliably — use awk instead.

2. Agent Rushing to Synthesis

Symptom: 6-hour session completes in 10 minutes with shallow findings.

Root cause: Without explicit pacing, Claude optimizes for completion. It treats research as a task to finish, not a time budget to fill.

Fix: Three mechanisms in SKILL.md:

a) Multi-pass loop — Pass 1 surveys broadly, Pass 2 researches its own logged questions, Pass 3+ deepens with primary sources and contradicting evidence. Each pass must complete before synthesis.

b) Hard time gates:

- Before 75% of time used: Do NOT enter synthesis mode.
- 75%-90%: Synthesis allowed if Pass 1 and 2 complete.
- 90%+: Enter synthesis immediately.

c) progress.md persistent state — survives context compaction and --continue restarts. Agent reads it on every session start.

d) Anti-rushing language (from Anthropic’s own agentic prompting guidance):

This is a long task. Use your full time budget. Do not stop early due to
token concerns — your context window will be compacted automatically.

3. Dead Agent Not Detected by Cron

Symptom: Cron restart script sees agent as alive when claude has exited. Agent sits idle indefinitely.

Root cause: run.sh ends with exec bash, keeping a shell alive in tmux. tmux has-session -t overnight returns true, so the restart script exits early.

Fix: After confirming tmux session exists, check if a claude process is actually running inside the pane:

if tmux has-session -t overnight 2>/dev/null; then
  if tmux list-panes -t overnight -F '#{pane_pid}' 2>/dev/null | while read pid; do
    pgrep -P "$pid" -f claude >/dev/null 2>&1 && exit 0
    for child in $(pgrep -P "$pid" 2>/dev/null); do
      pgrep -P "$child" -f claude >/dev/null 2>&1 && exit 0
    done
  done; then
    exit 0
  fi
  tmux kill-session -t overnight 2>/dev/null
fi

The grandchild check handles bash -> claude nesting.

4. Circuit Breaker Too Aggressive

Symptom: After 5 total restarts across an 8-hour session, the agent writes FAILED and stops permanently. Rate limits early in the session consume the entire restart budget.

Root cause: Lifetime max of 5 restarts. Over 8 hours, 5 is normal for transient issues.

Fix: Rolling 1-hour window, max 3 per hour. Critically: exit 0 (wait) instead of touch FAILED (give up):

MAX_PER_HOUR=3
RECENT_RESTARTS=0
ONE_HOUR_AGO=$(( $(date +%s) - 3600 ))
while IFS= read -r line; do
  LINE_DATE=$(echo "$line" | sed 's/: .*//')
  LINE_EPOCH=$(date -j -f "%a %b %d %T %Z %Y" "$LINE_DATE" +%s 2>/dev/null || echo 0)
  if [ "$LINE_EPOCH" -gt "$ONE_HOUR_AGO" ]; then
    RECENT_RESTARTS=$((RECENT_RESTARTS + 1))
  fi
done < <(grep "Restarting" "$WORKSPACE/restart.log")

if [ "$RECENT_RESTARTS" -ge "$MAX_PER_HOUR" ]; then
  exit 0  # Wait, don't give up
fi

5. `--continue` Without `-p` Fails in Pipes

Symptom: claude --continue --dangerously-skip-permissions 2>&1 | tee fails with Error: Input must be provided.

Root cause: --continue alone opens an interactive session. Piped to tee, stdin is not a TTY and there’s no prompt input.

Fix: Always pair --continue with -p and a resume prompt:

claude --continue -p "You were interrupted. Read progress.md and CLAUDE.md to see where you left off. Continue your research." --dangerously-skip-permissions --allowedTools 'Read,Write,Edit,Bash,Grep,Glob,WebSearch,WebFetch' 2>&1 | tee -a run.log

6. Unbound Variable After Rename

Symptom: Restart script silently fails under cron. No restarts happen.

Root cause: RESTART_COUNT renamed to RECENT_RESTARTS in the circuit breaker refactor, but one log line still referenced the old name. set -euo pipefail aborts on unbound variables.

Fix: Update the reference:

# Before (broken):
echo "$(date): Restarting (attempt $((RESTART_COUNT + 1)))"
# After:
echo "$(date): Restarting (attempt $((RECENT_RESTARTS + 1)))"

Prevention: Shell scripts have no compile-time checking. set -u catches this at runtime, but only on the code path that executes. grep for old variable names after any rename.

Prevention Checklist

Always test headless (-p) mode separately from interactive — skills, hooks, and session behavior differ
For long-running agents, add explicit time gates and anti-rushing language in the system prompt
Never rely solely on tmux session liveness — check the actual process inside the pane
Circuit breakers should rate-limit, not hard-cap — exit 0 to wait, never permanently fail on transient issues
After renaming variables in shell scripts, grep -r OLD_NAME scripts/ to find stale references
Test --continue with the same pipe/redirect setup you’ll use in production

Files Changed

.claude/skills/overnight/SKILL.md — multi-pass loop, time gates, progress tracking
scripts/overnight-launch.sh — embed skill instructions, anti-rushing prompt
scripts/overnight-restart.sh — process detection, rolling circuit breaker, variable fix
scripts/overnight-dashboard.py — stalled state detection, restart countdown