Anthropic and OpenAI push multi-agent workflows, turning users into AI supervisors
Anthropic’s Claude Opus 4.6 and OpenAI Frontier both move from chat to managed agent teams, with benchmarks rising but reliability and oversight still unresolved....

Key Takeaways
- Anthropic and OpenAI are steering users from single-bot chat to managing multiple agents that run tasks in parallel.
- OpenAI reports 77.3 percent on Terminal-Bench 2.0 for GPT-5.3-Codex, around 12 percentage points above Anthropic’s Opus 4.6.
- Anthropic is pushing long-context (up to 1 million tokens in beta) to support agent workflows across large codebases and documents.
- Market reaction has been volatile; Bloomberg tied agentic workflow fears to roughly 285 billion dollars wiped from software-adjacent stocks.
- For operators, the core shift is governance: permissions, memory, and review loops matter as much as prompting.
The newest enterprise push in AI isn’t “better chat,” it’s workload delegation: give multiple agents scoped tasks, run them in parallel, and have a human review, route, and correct the outputs.
Multi-agent tools shift work from prompting to oversight
Anthropic is pairing Claude Opus 4.6 with “agent teams” inside Claude Code, a research preview aimed at splitting read-heavy engineering work (like codebase reviews) across multiple subagents that coordinate concurrently. The workflow looks less like a chat window and more like a command center: developers can jump between subagents, take control of any thread, and let the rest keep running.
OpenAI is making a similar bet with Frontier, positioned as an enterprise platform for “AI co-workers” that can connect into business systems such as CRMs, ticketing tools, and data warehouses. OpenAI’s Barret Zoph framed it as moving agents into “true AI co-workers” in comments to CNBC. For marketers and operators, the practical implication is governance: identity, permissions, and memory become first-class concepts, because the user’s job shifts to assigning tasks, checking execution, and preventing silent errors.
Benchmarks rise, but market and workflow risk are rising too
On the model side, Opus 4.6 adds a beta context window up to 1 million tokens and posts a large jump on ARC AGI 2 versus Opus 4.5, suggesting better “human-easy, model-hard” reasoning. Long-context matters for multi-agent work because agents need to retrieve details across large codebases without losing the thread; Anthropic cites MRCR v2 results that improve sharply at the 1 million-token setting.
OpenAI, meanwhile, is tightening its agent stack with a Codex macOS “command center,” Git worktrees for isolated agent changes, and a new model, GPT-5.3-Codex. OpenAI says GPT-5.3-Codex hit 77.3 percent on Terminal-Bench 2.0, about 12 percentage points above Opus 4.6.
Investor anxiety is also part of the story. Bloomberg reported that agentic workflow releases helped erase roughly 285 billion dollars in market value across software-adjacent stocks amid fears model vendors could bundle end-to-end workflows that pressure SaaS incumbents (Bloomberg).
The near-term takeaway for B2B teams: multi-agent setups can speed drafts and parallelize research, but they increase the need for guardrails, QA, and audit trails—because the human is now the manager of systems that still miss details.
Stay Informed
Weekly AI marketing insights
Join 5,000+ marketers. Unsubscribe anytime.
