OpenAI's GPT-5.5 Outperforms Claude Fable 5 on New AI Professional Workflow Benchmark
Researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI) and over 300 domain experts have launched Agents’ Last Exam (ALE), a new benchmark designed to assess AI’s capacity for economically valuable, long-horizon professional tasks. OpenAI’s GPT-5.5, operating via the Codex harness, secured the top position on the ALE Leaderboard with a 24.0% pass rate. This places it ahead of Anthropic's new Mythos-class Claude Fable 5 model, which ranked third with a 22.0% score, highlighting current limitations in even the most advanced AI models for real-world workflows.

The Agents’ Last Exam (ALE), a new benchmark for evaluating artificial intelligence, has been introduced by researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI), in collaboration with over 300 domain experts. This rigorous test aims to measure whether AI can effectively execute complex, economically valuable professional workflows, addressing the gap between academic benchmarks and real-world labor impact.
OpenAI’s GPT-5.5, utilizing the Codex harness, achieved the highest score on the ALE Leaderboard, demonstrating a 24.0% pass rate. This performance exceeded Anthropic's recently released Claude Fable 5 model, which ranked third with a 22.0% pass rate. Other top-performing agentic harnesses included Ale Claw (GPT-5.5, 23.0%), OpenClaw (GPT-5.5, 21.1%), and Cursor CLI (Composer-2.5, 20.4%). Despite these leading scores, the overall pass rates remain low, with some configurations, such as Anthropic's older Claude Opus 4.8 and Google's Gemini CLI, recording a 0.0% pass rate on the most difficult 'Last-Exam' tier.
ALE distinguishes itself through its evaluation architecture, which enforces a Generalist Computer-Use Agent (GCUA) framework. This requires agents to navigate Linux or Windows virtual machines, performing both shell scripting and point-and-click operations within desktop software. The benchmark assesses capabilities across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). Crucially, ALE relies on deterministic, code-based evaluation for 93.2% of its workflows, minimizing the use of the subjective 'LLM-as-a-judge' paradigm.
The benchmark launched with 1,490 task instances, with a target of 5,000 tasks. These tasks are designed for authenticity, drawing from the U.S. federal occupational taxonomy (O*NET / SOC 2018) and covering 55 non-physical industry sub-domains. Workflows are sourced from industry practitioners and involve specialized software applications like Siemens NX for 3D model creation, Unreal Engine for scene setup, FSLeyes for neuroimaging analysis, and Adobe After Effects for visual effects compositing. Tasks are categorized into three difficulty tiers: Near-Term, Full-Spectrum, and Last-Exam.
To prevent benchmark contamination, ALE implements a dual-use deployment strategy. Only about 10% of the dataset is publicly released, while the majority of tasks are kept private. Private tasks are regularly rotated into the public pool, and retired public tasks are swapped out, ensuring that evaluations remain uncontaminated across successive AI model generations. The leaderboard also offers transparency by tracking both 'Full' scores, which may include tasks requiring proprietary software, and 'Unlicensed' scores, based solely on freely available tools.
Zengyi Qin, an MIT PhD researcher and project contributor, announced the launch, noting the participation of over 300 domain experts from more than 100 institutions. Project leads include Yiyou Sun, Xinyang Han, and Dawn Song. The low pass rates observed across the benchmark serve as a necessary reality check for the AI industry, indicating that even the most advanced models have significant room for improvement before they can fully integrate into complex professional workforces.
According to VentureBeat, this outcome aligns with recent analyses suggesting OpenAI's models are currently more adept at adhering to complex, multi-part prompts, a critical factor for success in ALE's demanding pipeline.
Advertisement
AdSense slot • inline


