devAlice
← AI Agents

Multi-Agent Workflow — Claude Code + Codex CLI Collaboration

Patterns that break when you move from one AI agent to two — role split, handoff document, verification gates, real failure cases.

The first thing that breaks when a one-agent user starts combining two or more is the assumption that they share context. Humans paper over gaps with Slack, meetings, and sitting next to each other. Agents can't. Delegation has to be expressed not as a promise but as a gate enforced in code.

This guide collects patterns from a week or more running Claude Code (main agent) + Codex CLI (sub agent). The same patterns apply to other combinations (Cursor + Claude Code, Aider + Claude Code, etc.).

TL;DR

  1. Role split — analysis / planning / instructions go to Opus (main); execution goes to Sonnet / Codex (sub). Same task on both diverges.
  2. Write a handoff document — not a Slack message. A self-contained document another agent can read and finish the work from.
  3. Bundle verification gates as codenpm run verify:{phase} scripts + CI checks. Without gates, gaps accumulate.
  4. One task per agent — don't blindly trust sub results. Main re-runs before merge.

Why Multi-Agent

One agent is enough for many tasks: single-file edits, short refactors, code reviews. Multi-agent wins in these cases:

  • Parallelizable independent work — frontend + backend + migration concurrently
  • Model-aligned split — reasoning (Opus) and execution (Sonnet/GPT) split for cost/speed
  • Context isolation — protect the main agent's context in big codebases
  • Independent verification — author and verifier separation gives objectivity

It's a loss when:

  • The work converges on a single file — merge conflicts eat the collaboration gain
  • The context is very small — handoff doc takes longer than the work
  • Spec is ambiguous — two agents will diverge on different interpretations

1. Role Split — Who Does What

The common mistake is "have both do the same job." Results diverge and no one owns it.

Recommended split:

RoleModel / ToolOwns
Main (Architect)Claude Opus (Claude Code)Analysis · plan · instructions · handoff review · merge decisions · documentation
Sub (Builder)Claude Sonnet subagent or Codex CLICLI runs · build · test · repetitive tasks · Explore (file search)
Reviewer(Optional) separate Sonnet instance or humanReproduce results · diff review · gate confirmation

Even if one person runs both, split sessions. If main verifies its own results with its own context, objectivity is lost.

What Only Main Does

  • Plan + tradeoff evaluation + alternative comparison
  • Document the "why" behind decisions
  • Write the handoff document
  • Verify sub's results before merge (re-run, inspect gate raw output)

What Only Sub Does

  • Clearly defined tasks — input/output/acceptance pinned in the handoff
  • Repetitive CLI work (build, test, migration runs)
  • Big codebase grep/find — protects main's context

Never Have Both Do

  • "Figure out how to do this and tell me" — two different answers to reconcile
  • "Write the code and verify it too" — self-review misses blind spots

2. Handoff Document Pattern

A handoff is a self-contained document another agent can use to finish the work from. Not a Slack message or short prompt.

Required Sections

# Phase X — {Title}
 
## Goal
{One sentence: what gets finished}
 
## Context
- Related files: `src/foo.ts:42`, `tests/foo.test.ts`
- Prior work: PR #123 merged, RFC-005 adopted
- External dep: API spec `docs/api.md` v2
 
## Scope
1. Change signature of {function B} in {file A}
2. Add new function D in {file C}
3. Update four tests in {file E}
 
## Non-goals
- Don't {tempting extension} (next phase work)
- Don't {refactor temptation} (separate PR)
 
## Acceptance
- [ ] `npm run verify:phase-X` passes
- [ ] `npm run typecheck` clean
- [ ] PR diff < 300 lines (split smaller if larger)
 
## Verification commands
\`\`\`bash
npm run verify:phase-X  # phase-specific gate
npm test src/foo        # unit tests
\`\`\`
 
## When stuck
- If you hit the same error 3 times, report back (no infinite self-loop)
- If external API returns differently than expected, report (don't fake responses)
 
## Author / Date
csmarch · 2026-05-12

Core Principles

  1. Self-contained — work must be possible from the handoff alone. "We talked about it on Slack" is invalid
  2. State scope AND non-scope — blocks well-intentioned scope creep
  3. Acceptance as code — replace "if it works, OK" with concrete npm run verify:phase-X gates
  4. Stuck-reporting rule — explicit limit on self-debugging time

Handoff length scales with task size. A 100-line handoff for a one-line task is a loss. 5+ files or a new module → handoff required; 1–2 file bug fix → one chat line OK.

3. Verification Gates — Promises as Code

The most common and most expensive mistake: "We'll verify after merge."

When sub reports done, main trusts it and moves to the next phase. If the next phase breaks something, you trace back. Gaps accumulate and eventually explode as a big catch-up.

Principle: bundle the verification script and merge gate in the same PR as the handoff.

Three Gates

  1. Local verification scriptnpm run verify:phase-X or npm run verify:auth, per-task job
  2. CI gate — GitHub Actions / GitLab CI runs verify:phase-X automatically. Merge blocked on failure
  3. Main re-run — main reproduces sub's result in its own environment. Same result → merge

Example — verify:phase-X Script

// package.json
{
  "scripts": {
    "verify:phase-A": "tsc --noEmit && eslint src/phase-a && jest src/phase-a --coverage",
    "verify:phase-B": "...",
    "verify:assets": "node scripts/check-asset-sha256.mjs"
  }
}

GitHub Actions Gate Example

# .github/workflows/ci.yml
name: CI
on: [pull_request]
jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
      - run: pnpm build
      - run: pnpm verify:assets   # phase verify follows the same pattern

Sub's Completion Report Format

When sub reports a phase done, attach:

[Phase X complete]
- Changed files: 5 (src/foo.ts, ...)
- verify raw output:
  $ npm run verify:phase-X
  ✓ tsc clean
  ✓ eslint clean
  ✓ jest 23 passed
- PR: #234

Reports without raw output → ask again. Don't trust "it works" text.

4. One Task Per Agent — Don't Blindly Trust Results

Don't move on as soon as sub reports done. Main does three things:

  1. Re-run — don't trust sub's verify raw output, main runs the same commands locally. If different, investigate (env diff? race? cache?)
  2. Diff review — review the PR diff at human review depth. Watch for out-of-scope changes, missing comments, absent tests
  3. Try counter-examples — beyond acceptance, run 1–2 "this had better not break" cases yourself

This isn't doubting sub's capability — agents fundamentally can't see their own context's blind spots. People can't review their own code either.

5. Lessons From Real Failures

Case — handoff existed, gate didn't

Situation: UI redesign phase D delegated. Handoff listed 8 acceptance items. Sub reported "all done." Main merged. Phase E began.

Problem: Acceptance #4 of phase D was not actually passing (sub judged a partial implementation as OK). Phase E used phase D's feature and broke. Debugging phase E required going back to phase D. Same pattern repeated in F.

Recovery: A separate phase to catch up D~F, with verify:phase-D scripts written retroactively. Had the gate existed up front, it would have failed first.

Lesson: Acceptance without a verification script floats away. Bundle handoff + verify script + CI gate in the same PR.

Case — Self Infinite Loop

Situation: Sub asked to "fix breaking tests." Sub tried 5 different things on its own code, each producing a new error. After an hour: "Tried various things, doesn't work."

Problem: No stuck-reporting rule. Sub had no criterion for when to ask for help.

Recovery: Added "report immediately after 3 same-error attempts" to the handoff template. Main provided a different hypothesis with broader context.

Lesson: Stipulate stuck-rules in the handoff. "Keep trying" → "After 3 failures, report."

6. Onboarding Checklist

First week moving to multi-agent:

  1. Define the main agent's persona/rulesCLAUDE.md or equivalent: role, autonomy, prohibitions, code style
  2. Start with one sub — Codex CLI or one Sonnet subagent. Not 3–4 at once
  3. Keep first handoffs short — 1–2 file work, 2–3 acceptance items, one verify script
  4. Register one verify gate in GitHub Actions — add more as phases multiply
  5. Apply the stuck-rule — report after 3 failures + main reviews
  6. Retro after each phase — one-line note on what was missing. Handoff template evolves over time

Troubleshooting

"The handoff is so long I lose motivation"

Task too big. Slice the phase smaller. Anything past 100 lines means the work takes more than a week.

"Sub keeps doing out-of-scope work"

Missing or vague "Non-goals" section. Instead of "no refactoring," write "do not touch src/legacy/" — concretely.

"No time to write verify scripts"

Write the verify script when the handoff is written, not after the code. Each acceptance item maps to an assertion in verify:phase-X.

"Two agents keep conflicting"

Context isolation failed. Their scopes overlap. The handoff's "scope — file list" must be cleanly separable. Two people editing the same file conflict too.

"Main becomes a bottleneck"

The answer isn't giving sub more authority. Main retains verification + merge decisions. Instead, invest in handoff/gate automation and reduce the amount of work (smaller scope, simpler plan).

References

Changelog

  • 2026-05-12: First draft. A week+ of operating experience as six sections + five troubleshooting cases.

Comments