OpenAI Stops Using SWE-bench Verified for Coding Evaluations

OpenAI announced that it no longer evaluates SWE-bench Verified ^[1].

This shift reflects a growing challenge in the artificial intelligence industry: benchmarks often fail to keep pace with the rapid evolution of frontier models. As AI coding capabilities advance, the metrics used to measure them can become obsolete—rendering them ineffective for guiding future development.

OpenAI said, "We are no longer evaluating SWE-bench Verified" ^[1]. The company said that the decision was based on the fact that the evaluation method was no longer suitable for measuring frontier coding capabilities ^[1].

Software engineering benchmarks like SWE-bench are designed to test a model's ability to resolve real-world GitHub issues. However, when models reach a certain threshold of proficiency, the specific constraints or patterns of a particular benchmark may no longer provide the granular data needed to identify remaining weaknesses.

By moving away from this specific tool, OpenAI indicates a need for new testing frameworks that can be used to stress-test the limits of high-performing models. The company did not specify which alternative benchmarks it will prioritize to replace the system.

This transition highlights the volatile nature of AI evaluation. The industry relies on these benchmarks to compare models across different labs, but the utility of any single test diminishes as the models it measures become more sophisticated.

“"We are no longer evaluating SWE-bench Verified."”

The discontinuation of SWE-bench Verified suggests that frontier AI models are outgrowing current industry-standard benchmarks. This creates a 'measurement gap' where developers may lack objective ways to quantify improvements in coding logic and software engineering, potentially leading to a shift toward more complex, proprietary, or human-centric evaluation methods.

Sources

[1]duckduckgo news — Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance

[2]duckduckgo news — Sonar Claims Top Spot on SWE-bench leaderboard

[3]qwant news — Zhipu AI Throws Down the Gauntlet: GLM-5 Arrives as China’s Most Ambitious Challenge to OpenAI’s Frontier Models

[4]qwant news — Google’s Gemini 3.1 Pro Arrives With a Bold Claim: The Best AI Model in the World

[5]qwant news — Gemini 2.5 Pro is Google’s ’most intelligent AI model’ with thinking built-in

OpenAI Stops Using SWE-bench Verified for Coding Evaluations

Sources

Related

Gemini Stock Rises After $100M Bitcoin Investment

Myriad Prediction Market Integrates With Trust Wallet

Waymo Recalls Thousands of Robotaxis After Vehicle Swept Away in US Flood

Comments