Headless Claude Agent: Multi-Pass Research Loop and Process Lifecycle Fixes
Headless Claude Agent: Multi-Pass Research and Lifecycle Fixes
Six bugs discovered and fixed during the first real overnight agent run. All relate to running Claude Code in headless mode (claude -p) for long autonomous sessions.
1. SKILL.md Not Loaded in Headless Mode
Symptom: Agent answers the research question in one shot (3 minutes) instead of following the exploration loop.
Root cause: claude -p "$TASK" passes a one-shot prompt. The skill system (SKILL.md with YAML frontmatter) is not activated in headless mode. The agent never sees the behavioral instructions.
Fix: Strip YAML frontmatter from SKILL.md with awk and embed the body into the workspace CLAUDE.md, which Claude Code auto-reads on startup.
# Extract skill body (strip YAML frontmatter)
SKILL_BODY=$(awk 'BEGIN{fm=0} /^---$/{fm++; next} fm>=2{print}' "$SKILL_FILE")
# Append to workspace CLAUDE.md
cat > "$WORKSPACE/CLAUDE.md" << CLAUDEEOF
# Task
$TASK
...
---
$SKILL_BODY
CLAUDEEOF
The run.sh prompt changed from the bare task to:
claude -p "You are an overnight research agent. Read CLAUDE.md in this directory for your task, behavioral instructions, end time, and branch budget. Follow the exploration loop exactly as described. Start by reading CLAUDE.md now."
Gotcha: BSD sed (macOS) can’t strip YAML frontmatter reliably — use awk instead.
2. Agent Rushing to Synthesis
Symptom: 6-hour session completes in 10 minutes with shallow findings.
Root cause: Without explicit pacing, Claude optimizes for completion. It treats research as a task to finish, not a time budget to fill.
Fix: Three mechanisms in SKILL.md:
a) Multi-pass loop — Pass 1 surveys broadly, Pass 2 researches its own logged questions, Pass 3+ deepens with primary sources and contradicting evidence. Each pass must complete before synthesis.
b) Hard time gates:
- Before 75% of time used: Do NOT enter synthesis mode.
- 75%-90%: Synthesis allowed if Pass 1 and 2 complete.
- 90%+: Enter synthesis immediately.
c) progress.md persistent state — survives context compaction and --continue restarts. Agent reads it on every session start.
d) Anti-rushing language (from Anthropic’s own agentic prompting guidance):
This is a long task. Use your full time budget. Do not stop early due to
token concerns — your context window will be compacted automatically.
3. Dead Agent Not Detected by Cron
Symptom: Cron restart script sees agent as alive when claude has exited. Agent sits idle indefinitely.
Root cause: run.sh ends with exec bash, keeping a shell alive in tmux. tmux has-session -t overnight returns true, so the restart script exits early.
Fix: After confirming tmux session exists, check if a claude process is actually running inside the pane:
if tmux has-session -t overnight 2>/dev/null; then
if tmux list-panes -t overnight -F '#{pane_pid}' 2>/dev/null | while read pid; do
pgrep -P "$pid" -f claude >/dev/null 2>&1 && exit 0
for child in $(pgrep -P "$pid" 2>/dev/null); do
pgrep -P "$child" -f claude >/dev/null 2>&1 && exit 0
done
done; then
exit 0
fi
tmux kill-session -t overnight 2>/dev/null
fi
The grandchild check handles bash -> claude nesting.
4. Circuit Breaker Too Aggressive
Symptom: After 5 total restarts across an 8-hour session, the agent writes FAILED and stops permanently. Rate limits early in the session consume the entire restart budget.
Root cause: Lifetime max of 5 restarts. Over 8 hours, 5 is normal for transient issues.
Fix: Rolling 1-hour window, max 3 per hour. Critically: exit 0 (wait) instead of touch FAILED (give up):
MAX_PER_HOUR=3
RECENT_RESTARTS=0
ONE_HOUR_AGO=$(( $(date +%s) - 3600 ))
while IFS= read -r line; do
LINE_DATE=$(echo "$line" | sed 's/: .*//')
LINE_EPOCH=$(date -j -f "%a %b %d %T %Z %Y" "$LINE_DATE" +%s 2>/dev/null || echo 0)
if [ "$LINE_EPOCH" -gt "$ONE_HOUR_AGO" ]; then
RECENT_RESTARTS=$((RECENT_RESTARTS + 1))
fi
done < <(grep "Restarting" "$WORKSPACE/restart.log")
if [ "$RECENT_RESTARTS" -ge "$MAX_PER_HOUR" ]; then
exit 0 # Wait, don't give up
fi
5. --continue Without -p Fails in Pipes
Symptom: claude --continue --dangerously-skip-permissions 2>&1 | tee fails with Error: Input must be provided.
Root cause: --continue alone opens an interactive session. Piped to tee, stdin is not a TTY and there’s no prompt input.
Fix: Always pair --continue with -p and a resume prompt:
claude --continue -p "You were interrupted. Read progress.md and CLAUDE.md to see where you left off. Continue your research." --dangerously-skip-permissions --allowedTools 'Read,Write,Edit,Bash,Grep,Glob,WebSearch,WebFetch' 2>&1 | tee -a run.log
6. Unbound Variable After Rename
Symptom: Restart script silently fails under cron. No restarts happen.
Root cause: RESTART_COUNT renamed to RECENT_RESTARTS in the circuit breaker refactor, but one log line still referenced the old name. set -euo pipefail aborts on unbound variables.
Fix: Update the reference:
# Before (broken):
echo "$(date): Restarting (attempt $((RESTART_COUNT + 1)))"
# After:
echo "$(date): Restarting (attempt $((RECENT_RESTARTS + 1)))"
Prevention: Shell scripts have no compile-time checking. set -u catches this at runtime, but only on the code path that executes. grep for old variable names after any rename.
Prevention Checklist
- Always test headless (
-p) mode separately from interactive — skills, hooks, and session behavior differ - For long-running agents, add explicit time gates and anti-rushing language in the system prompt
- Never rely solely on tmux session liveness — check the actual process inside the pane
- Circuit breakers should rate-limit, not hard-cap —
exit 0to wait, never permanently fail on transient issues - After renaming variables in shell scripts,
grep -r OLD_NAME scripts/to find stale references - Test
--continuewith the same pipe/redirect setup you’ll use in production
Files Changed
.claude/skills/overnight/SKILL.md— multi-pass loop, time gates, progress trackingscripts/overnight-launch.sh— embed skill instructions, anti-rushing promptscripts/overnight-restart.sh— process detection, rolling circuit breaker, variable fixscripts/overnight-dashboard.py— stalled state detection, restart countdown