Breaking
BreakingThese Football TimesPelé's 1970 World Cup Impact Highlighted Beyond Goals· 3 minutes agoBreakingThese Football TimesLiverpool's Initial European Campaign Marked by Coin-Toss Wins and Controversy· 3 minutes agoBreakingThese Football TimesCrystal Palace vs. Brighton: A Fierce English Football Rivalry· 3 minutes agoBreakingThese Football TimesChile Goalkeeper Roberto Rojas Took Extreme Measures for 1990 World Cup Qualification· 3 minutes agoBreaking90min SoccerEngland's 2026 World Cup Home and Away Kits Leaked Online· 3 minutes agoBreaking90min SoccerSpain's 2026 World Cup Home Kit Design Leaked Online· 3 minutes agoBreakingIndependent FootballJesse Marsch Alleges He Had to 'Beg' US Players to Sing National Anthem· 3 minutes agoBreakingThe HillRep. Comer Requests Alan Dershowitz Testimony in Jeffrey Epstein Investigation· 3 minutes agoBreakingFortuneMeta's Hyperion AI Data Center Creates Varied Economic Impacts in Rural Louisiana· 3 minutes agoBreakingFortuneUS Energy Secretary and Chevron CEO Disagree on Persian Gulf Oil Flow Figures· 3 minutes agoBreakingThese Football TimesPelé's 1970 World Cup Impact Highlighted Beyond Goals· 3 minutes agoBreakingThese Football TimesLiverpool's Initial European Campaign Marked by Coin-Toss Wins and Controversy· 3 minutes agoBreakingThese Football TimesCrystal Palace vs. Brighton: A Fierce English Football Rivalry· 3 minutes agoBreakingThese Football TimesChile Goalkeeper Roberto Rojas Took Extreme Measures for 1990 World Cup Qualification· 3 minutes agoBreaking90min SoccerEngland's 2026 World Cup Home and Away Kits Leaked Online· 3 minutes agoBreaking90min SoccerSpain's 2026 World Cup Home Kit Design Leaked Online· 3 minutes agoBreakingIndependent FootballJesse Marsch Alleges He Had to 'Beg' US Players to Sing National Anthem· 3 minutes agoBreakingThe HillRep. Comer Requests Alan Dershowitz Testimony in Jeffrey Epstein Investigation· 3 minutes agoBreakingFortuneMeta's Hyperion AI Data Center Creates Varied Economic Impacts in Rural Louisiana· 3 minutes agoBreakingFortuneUS Energy Secretary and Chevron CEO Disagree on Persian Gulf Oil Flow Figures· 3 minutes ago
Technology
Source: VentureBeat

OpenAI's GPT-5.5 Outperforms Claude Fable 5 on New AI Professional Workflow Benchmark

Researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI) and over 300 domain experts have launched Agents’ Last Exam (ALE), a new benchmark designed to assess AI’s capacity for economically valuable, long-horizon professional tasks. OpenAI’s GPT-5.5, operating via the Codex harness, secured the top position on the ALE Leaderboard with a 24.0% pass rate. This places it ahead of Anthropic's new Mythos-class Claude Fable 5 model, which ranked third with a 22.0% score, highlighting current limitations in even the most advanced AI models for real-world workflows.

By Fainaron·Jun 12, 2026 (15 hours ago)·1 views
OpenAI's GPT-5.5 Outperforms Claude Fable 5 on New AI Professional Workflow Benchmark

The Agents’ Last Exam (ALE), a new benchmark for evaluating artificial intelligence, has been introduced by researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI), in collaboration with over 300 domain experts. This rigorous test aims to measure whether AI can effectively execute complex, economically valuable professional workflows, addressing the gap between academic benchmarks and real-world labor impact.

OpenAI’s GPT-5.5, utilizing the Codex harness, achieved the highest score on the ALE Leaderboard, demonstrating a 24.0% pass rate. This performance exceeded Anthropic's recently released Claude Fable 5 model, which ranked third with a 22.0% pass rate. Other top-performing agentic harnesses included Ale Claw (GPT-5.5, 23.0%), OpenClaw (GPT-5.5, 21.1%), and Cursor CLI (Composer-2.5, 20.4%). Despite these leading scores, the overall pass rates remain low, with some configurations, such as Anthropic's older Claude Opus 4.8 and Google's Gemini CLI, recording a 0.0% pass rate on the most difficult 'Last-Exam' tier.

ALE distinguishes itself through its evaluation architecture, which enforces a Generalist Computer-Use Agent (GCUA) framework. This requires agents to navigate Linux or Windows virtual machines, performing both shell scripting and point-and-click operations within desktop software. The benchmark assesses capabilities across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). Crucially, ALE relies on deterministic, code-based evaluation for 93.2% of its workflows, minimizing the use of the subjective 'LLM-as-a-judge' paradigm.

The benchmark launched with 1,490 task instances, with a target of 5,000 tasks. These tasks are designed for authenticity, drawing from the U.S. federal occupational taxonomy (O*NET / SOC 2018) and covering 55 non-physical industry sub-domains. Workflows are sourced from industry practitioners and involve specialized software applications like Siemens NX for 3D model creation, Unreal Engine for scene setup, FSLeyes for neuroimaging analysis, and Adobe After Effects for visual effects compositing. Tasks are categorized into three difficulty tiers: Near-Term, Full-Spectrum, and Last-Exam.

To prevent benchmark contamination, ALE implements a dual-use deployment strategy. Only about 10% of the dataset is publicly released, while the majority of tasks are kept private. Private tasks are regularly rotated into the public pool, and retired public tasks are swapped out, ensuring that evaluations remain uncontaminated across successive AI model generations. The leaderboard also offers transparency by tracking both 'Full' scores, which may include tasks requiring proprietary software, and 'Unlicensed' scores, based solely on freely available tools.

Zengyi Qin, an MIT PhD researcher and project contributor, announced the launch, noting the participation of over 300 domain experts from more than 100 institutions. Project leads include Yiyou Sun, Xinyang Han, and Dawn Song. The low pass rates observed across the benchmark serve as a necessary reality check for the AI industry, indicating that even the most advanced models have significant room for improvement before they can fully integrate into complex professional workforces.

According to VentureBeat, this outcome aligns with recent analyses suggesting OpenAI's models are currently more adept at adhering to complex, multi-part prompts, a critical factor for success in ALE's demanding pipeline.

Advertisement

AdSense slot • inline

Source attribution: This article was AI-curated and rewritten by Fainaron from a piece originally published by VentureBeat. Read the original at VentureBeat →

More like this

SpaceX Launches 29 Starlink Satellites, Completes NASDAQ Public Offering
Technology
3 minutes ago

SpaceX Launches 29 Starlink Satellites, Completes NASDAQ Public Offering

SpaceX successfully launched 29 Starlink satellites into orbit aboard a Falcon 9 rocket. This liftoff occurred at 8:27 a.m. ET, just over an hour before the company completed its public offering on the NASDAQ stock market.

Space.com
UK Police Officer Under Criminal Investigation for Alleged AI Use
Technology
3 minutes ago

UK Police Officer Under Criminal Investigation for Alleged AI Use

An unidentified police officer in the UK has been removed from frontline duties and is currently under criminal investigation. The probe centers on allegations of using artificial intelligence (AI) to "create evidential material in a number of cases" and perverting the course of justice. This incident marks the first known case of its kind in the United Kingdom involving a police officer and alleged AI misuse.

The Guardian Tech
SpaceX IPO Debuts Strong, Shares Close Up 19%
Technology
3 minutes ago

SpaceX IPO Debuts Strong, Shares Close Up 19%

SpaceX made its heavily anticipated initial public offering (IPO) debut on Friday. Shares of the company traded above their initial $135 IPO price, ultimately closing the day with a 19% increase. This successful market launch reportedly led to the creation of the world's first trillionaire.

TechCrunch
Microsoft Explores Xbox Spin-Off, Accelerated Game Development
Technology
8 minutes ago

Microsoft Explores Xbox Spin-Off, Accelerated Game Development

Microsoft is reportedly considering significant changes for its Xbox brand, including the possibility of spinning it off into a separate entity or operating it as a wholly owned subsidiary. These deliberations come amid recent challenges for Xbox, including subscriber losses for its Game Pass service. The company also reportedly plans to increase investment in key game franchises such as Halo, The Elder Scrolls, and Fallout to expedite development, while potentially renewing its focus on console exclusive titles.

GameSpot

By the numbers

Fainaron — live counters

Updated every 30 seconds. Automatically — no human edits.

Total Articles

5.7K

Visitors Today

153

This Month

313

Lifetime Visitors

313

Article Views

1.1K

Pageviews Today

786

Pageviews Lifetime

1.2K

Last 30 Days

313

as of 6/12/2026, 8:27:46 PM