Anthropic just dropped a benchmarking study that should make the scientific community sit up and pay attention. The company's discovery team developed BioMysteryBench, a bioinformatics evaluation containing 99 real-world tasks drawn from DNA/RNA sequencing data, proteomics, and metabolomics. The results? Claude Opus 4 solved 94% of human-solvable problems reliably—and more interestingly, cracked roughly 30% of questions that stumped panels of five domain experts entirely.
Why Scientific Benchmarks Are Hard to Build
The research team spent considerable effort explaining why scientific capability evals are notoriously difficult to construct. Biology in particular presents three compounding challenges: there are often multiple valid approaches to any given problem (unlike math with a single correct answer), individual research decisions introduce subjectivity that can swing conclusions in noisy datasets, and the most impactful questions—where AI could have the greatest utility—are precisely those humans haven't solved yet. Traditional benchmarks like MMLU-Pro test knowledge recall while agentic evals like SciGym use simulated environments where ground truth is known. BioMysteryBench takes a different tack: questions are derived from controllable properties of real biological data, not human conclusions. For instance, "What organism does this crystal structure belong to?" has an objective answer, and viral species identification can be validated against PCR assay metadata.
Claude's Reliability Varies Wildly Between Task Difficulty
The most striking finding isn't raw accuracy—it's reliability. When Anthropic ran each model five times per problem, a clear bimodal pattern emerged on human-solvable tasks: Opus 4 solved problems either nearly every attempt or not at all, with 86% of its solves occurring at least four out of five times. But flip to the human-difficult set and that drops to just 44%, with 44% of correct answers appearing only once in five attempts—statistically lucky reasoning paths rather than reproducible methods. This "brittleness" on hard problems is the more interesting story, according to Anthropic's own analysis. Sonnet 4 shows this even more sharply: 75% reliable solves on easy tasks collapse to just 22% on difficult ones, while its share of lucky wins jumps from 9% to 56%. The headline accuracy drop (77.4% down to 23.5%) understates what's actually happening—models know what they know on solvable problems but stumble onto solutions rather than derive them reliably on hard ones.
Two Strategies That Set Claude Apart
Anthropic identified two approaches Claude uses that humans don't: first, a "know-it-all" capability where Opus leverages its training on hundreds of thousands of papers to perform meta-analyses and database stitching directly from internal knowledge—solving tasks that would require human experts weeks of literature review. Second, when uncertain, Claude layers multiple independent methods and converges on answers where different approaches agree. This contrasts with human benchmarkers who tended toward single-strategy solutions. One particularly telling example showed humans using algorithms or databases to annotate dataset properties while Claude intuitively recognized sequence patterns—echoing how the TATA box promoter was originally discovered by a scientist noticing repetition in upstream gene sequences.
CompBioBench Echoes Results Independently
Anthropic notes that Genentech and Roche released their own computational biology benchmark, CompBioBench, around the same time with strikingly similar findings. Their 100-task eval showed Claude Opus 4 reaching 81% overall and 69% on hardest questions—reinforcing that frontier models have crossed a threshold where they're genuinely useful scientific collaborators rather than mere chat assistants.
Key Takeaways
- BioMysteryBench contains 99 real-world bioinformatics tasks with objective, verifiable answers derived from data properties rather than human conclusions
- Claude Opus 4 achieved 94% reliable solve rates on human-solvable problems (4+ out of 5 attempts) but only ~30% on questions humans couldn't answer
- Reliability analysis reveals that roughly half of "wins" on difficult problems are lucky reasoning paths rather than reproducible methods—a key metric beyond raw accuracy
- Independent CompBioBench from Genentech/Roche reached similar conclusions, suggesting these findings aren't artifacts of one benchmark's design
The Bottom Line
The gap between what AI can retrieve reliably versus derive creatively is the real frontier here—and that's where the interesting work lies. Anthropic's own models identified this pattern in their data before humans pointed it out, which suggests we're watching something genuinely new emerge: not just a powerful lookup engine, but a system developing something that looks uncomfortably like scientific taste.