• About
  • Privacy Poilicy
  • Disclaimer
  • Contact
CoinInsight
  • Home
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Ripple
  • Future of Crypto
  • Crypto Mining
No Result
View All Result
  • Home
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Ripple
  • Future of Crypto
  • Crypto Mining
No Result
View All Result
CoinInsight
No Result
View All Result
Home Blockchain

OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Had been Flawed

Coininsight by Coininsight
March 3, 2026
in Blockchain
0
OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Had been Flawed
189
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter

Related articles

Easy methods to Turn out to be a Fintech Skilled?

Easy methods to Turn out to be a Fintech Skilled?

March 3, 2026
Ethereum Worth Up as BitMine Buys $29M ETH From Galaxy Digital

Ethereum Value Holds Regular Round $2,908 As Bitmine Provides ETH

March 2, 2026




Rebeca Moen
Mar 03, 2026 18:33

OpenAI reveals main contamination points in SWE-bench Verified benchmark, displaying frontier AI fashions memorized options and exams rejected right code.



OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed

OpenAI has stopped reporting scores on SWE-bench Verified, the widely-used AI coding benchmark, after discovering that almost 60% of issues its fashions failed contained essentially damaged exams. The corporate’s February 23, 2026 evaluation additionally discovered proof that every one main frontier fashions—together with GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash—had been skilled on benchmark options, rendering scores meaningless.

“Enhancements on SWE-bench Verified now not mirror significant enhancements in fashions’ real-world software program improvement talents,” OpenAI acknowledged. “As a substitute, they more and more mirror how a lot the mannequin was uncovered to the benchmark at coaching time.”

The Numbers Inform the Story

OpenAI audited 138 issues—27.6% of the 500-problem dataset—that its o3 mannequin could not persistently clear up throughout 64 unbiased runs. The findings had been damning: 59.4% of those issues had materials points in check design or downside descriptions that made them “extraordinarily tough or inconceivable even for essentially the most succesful mannequin or human to unravel.”

Breaking down the failures: 35.5% of audited duties had overly strict exams that rejected functionally right options by demanding particular implementation particulars by no means talked about in downside descriptions. One other 18.8% examined for performance that wasn’t even specified within the job.

One instance concerned a pylint PR the place exams required importing a perform known as “get_annotation”—a reputation by no means talked about in the issue assertion. Fashions that solved the underlying difficulty accurately nonetheless failed as a result of they did not psychically guess the anticipated perform identify.

Each Main Mannequin Is Contaminated

The contamination proof proved extra troubling. OpenAI constructed an automatic red-teaming system utilizing GPT-5 to probe competing fashions for benchmark data. The outcomes confirmed all examined frontier fashions may reproduce authentic human-written options or quote verbatim downside particulars they need to by no means have seen.

GPT-5.2, when given minimal hints, reproduced the precise code patch for a Django authentication repair—together with the particular conditional assertion “if username is None or password is None.” Claude Opus 4.5 quoted word-for-word an inline remark from a gold patch it supposedly by no means encountered. Gemini 3 Flash, given solely a job ID, output the whole unified diff with right line numbers.

The contamination creates an unfair benefit. Fashions which have seen options throughout coaching can cross underspecified exams by “remembering” implementation particulars that weren’t in the issue description—basically having the reply key earlier than the examination.

From 80% to 23%

The benchmark’s decay turned seen in stalled progress. State-of-the-art scores improved solely from 74.9% to 80.9% over six months—not as a result of fashions hit functionality ceilings, however as a result of the remaining issues had been both inconceivable or required memorized data.

SWE-bench Professional, the really useful alternative, paints a special image. In response to latest information from February 26, 2026, fashions scoring 80% on Verified dropped to roughly 23% on Professional—a benchmark designed to withstand contamination. Claude Opus 4.6 at the moment leads Professional with 79.20% efficiency, although that determine measures a special, cleaner check set.

What Comes Subsequent

OpenAI recommends the business shift to SWE-bench Professional’s public break up whereas acknowledging it is imperfect. The corporate is investing in privately-authored benchmarks like GDPVal, the place area specialists create authentic duties and skilled reviewers grade options holistically.

The broader lesson issues for anybody monitoring AI capabilities: benchmarks sourced from public repositories carry inherent contamination threat. When coaching information consists of the check, scores change into theater. For researchers, traders, and builders betting on AI coding progress, the actual frontier is tougher to measure than leaderboards recommend.

Picture supply: Shutterstock


Tags: AbandonsFailedFindingFlawedOpenAISWEbenchTestsVerified
Share76Tweet47

Related Posts

Easy methods to Turn out to be a Fintech Skilled?

Easy methods to Turn out to be a Fintech Skilled?

by Coininsight
March 3, 2026
0

The recognition of recent phrases equivalent to fintech and edtech has been making numerous buzz in discussions throughout varied know-how...

Ethereum Worth Up as BitMine Buys $29M ETH From Galaxy Digital

Ethereum Value Holds Regular Round $2,908 As Bitmine Provides ETH

by Coininsight
March 2, 2026
0

Be part of Our Telegram channel to remain updated on breaking information protection The Ethereum value has climbed by a...

AAVE Worth Prediction: Targets $137 by March with Technical Restoration Underway

AAVE Worth Prediction: Targets $137 by March with Technical Restoration Underway

by Coininsight
March 2, 2026
0

Terrill Dicki Mar 01, 2026 10:27 Aave rebounds 6.70% to $113.11 as analysts eye $137 breakout...

Designing Sustainable Utility Tokens in 2026

Designing Sustainable Utility Tokens in 2026

by Coininsight
March 1, 2026
0

The sporadic progress of the crypto panorama has caught the eye of the entire world. Nearly over a decade in...

River Crypto Value Prediction: Analyst Urges Warning Following Sharp Beneficial properties

River Crypto Value Prediction: Analyst Urges Warning Following Sharp Beneficial properties

by Coininsight
March 1, 2026
0

Be a part of Our Telegram channel to remain updated on breaking information protection River has shortly change into one...

Load More
  • Trending
  • Comments
  • Latest
MetaMask Launches An NFT Reward Program – Right here’s Extra Data..

MetaMask Launches An NFT Reward Program – Right here’s Extra Data..

July 24, 2025
Finest Bitaxe Gamma 601 Overclock Settings & Tuning Information

Finest Bitaxe Gamma 601 Overclock Settings & Tuning Information

November 26, 2025
Naval Ravikant’s Web Price (2025)

Naval Ravikant’s Web Price (2025)

September 21, 2025
Haedal token airdrop information

Haedal token airdrop information

April 24, 2025
Kuwait bans Bitcoin mining over power issues and authorized violations

Kuwait bans Bitcoin mining over power issues and authorized violations

2
The Ethereum Basis’s Imaginative and prescient | Ethereum Basis Weblog

The Ethereum Basis’s Imaginative and prescient | Ethereum Basis Weblog

2
Unchained Launches Multi-Million Greenback Bitcoin Legacy Mission

Unchained Launches Multi-Million Greenback Bitcoin Legacy Mission

1
Earnings Preview: Microsoft anticipated to report larger Q3 income, revenue

Earnings Preview: Microsoft anticipated to report larger Q3 income, revenue

1
OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Had been Flawed

OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Had been Flawed

March 3, 2026
The three largest stinkers in my SIPP plunged once more this week – what on earth ought to I do?

Why worth shares are outperforming progress shares in 2026

March 3, 2026
Nasdaq Needs Buyers to Make Sure or No Bets on Its Index amid Occasion-Buying and selling Increase

Nasdaq Needs Buyers to Make Sure or No Bets on Its Index amid Occasion-Buying and selling Increase

March 3, 2026
Shiba Inu Eyes Potential Rebound as Ethereum Tokenization Expands

Shiba Inu Eyes Potential Rebound as Ethereum Tokenization Expands

March 3, 2026

CoinInight

Welcome to CoinInsight.co.uk – your trusted source for all things cryptocurrency! We are passionate about educating and informing our audience on the rapidly evolving world of digital assets, blockchain technology, and the future of finance.

Categories

  • Bitcoin
  • Blockchain
  • Crypto Mining
  • Ethereum
  • Future of Crypto
  • Market
  • Regulation
  • Ripple

Recent News

OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Had been Flawed

OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Had been Flawed

March 3, 2026
The three largest stinkers in my SIPP plunged once more this week – what on earth ought to I do?

Why worth shares are outperforming progress shares in 2026

March 3, 2026
  • About
  • Privacy Poilicy
  • Disclaimer
  • Contact

© 2025- https://coininsight.co.uk/ - All Rights Reserved

No Result
View All Result
  • Home
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Ripple
  • Future of Crypto
  • Crypto Mining

© 2025- https://coininsight.co.uk/ - All Rights Reserved

Social Media Auto Publish Powered By : XYZScripts.com
Verified by MonsterInsights