Crosby Launches Redline Bench to Evaluate AI Models in Contract Review
Crosby, a tech-driven law firm, has introduced Redline Bench, a new benchmark designed to assess the performance of artificial intelligence models in real-world legal tasks, specifically contract review. The tool aims to help lawyers determine the trustworthiness and quality of AI-generated legal work, addressing the inherent ambiguity in defining 'good' or 'bad' legal outcomes. The Redline Bench was developed by Crosby's Intelligence unit and involves a methodology where senior lawyers simulate software deals to create weighted criteria for contract changes. Initial tests using this benchmark placed ChatGPT 5.5 at the top with a score of 50.5%, followed by Gemini 3.5 Flash and Claude Opus.
Crosby, a tech-driven law firm, has released Redline Bench, a new benchmark intended to measure the efficacy of artificial intelligence models in legal contract negotiations. The initiative aims to provide lawyers with a standardized method to evaluate whether they can rely on AI technology for complex legal work.
The promise of AI absorbing routine legal tasks involves billions of dollars, but defining the quality of AI's legal output has been a challenge. Ryan Daniels, a former in-house lawyer and Crosby founder, highlighted that unlike software coding, where functionality is clear, legal work can be subjective. A single contract edit, or 'redline,' might be viewed differently by various legal professionals.
To tackle this ambiguity, Crosby formed its Intelligence unit, comprising engineers like Sharan Ramjee, known for work on transformer models at Stripe, and lawyers such as Ross Weiser, formerly of Sullivan & Cromwell. This team developed the Redline Bench. Crosby also collaborated with Micro1, a company that facilitates recruiting expert workers for model-makers, to refine the criteria for 'good' legal work.
The benchmark's development involved senior lawyers simulating software deals and identifying the most crucial contract changes at each negotiation stage. These changes were then converted into weighted criteria. During testing, AI models are provided with the same contracts and tasked with making their own edits. A panel of three judges subsequently compares these AI-generated redlines against the lawyer-built rubric, voting pass or fail on each item to generate a final score.
Crosby plans to make Redline Bench publicly accessible, allowing any lab to test its models. The company also intends to regularly publish reports detailing how major AI models compare. Initial findings from Redline Bench showed ChatGPT 5.5 leading with a score of 50.5%, indicating its redlines matched half of the lawyers' prioritized edits. Gemini 3.5 Flash scored 45.1%, and Claude Opus achieved 44.4%. An early, limited test of Anthropic's Fable 5 showed promising results at 47.3% before the model was withdrawn.
According to Business Insider, Crosby isn't.

