An Independent Check Caught a Claim We'd Overstated
On this page
Earlier this month we had a claim live on our site about where our ESM-2 variant scoring stops being useful — and a check inside our own pipeline showed it was over-stated. We corrected it. This is a short note on what happened, because the mechanism matters more than the correction.
What was on our pages — and what we’d concluded internally
We score variants with ESM-2, a protein language model, and we publish where it works and where it doesn’t. On the disordered-regions boundary, two of our own statements pointed in opposite directions. On two live pages, a note overstated the evidence — it called intrinsic disorder a documented boundary condition for ESM-2, flatly, as established fact. Separately, internally, our research had reached the opposite-seeming conclusion: that no published study had directly tested ESM-2 zero-shot scoring inside disordered regions — checked against experimental mutation data (deep mutational scanning), at residue resolution, across multiple proteins. (That “against experimental mutation data” qualifier is load-bearing: clinical-label studies do touch disordered regions, so the internal claim was about the harder, direct test.) A public page calling the effect documented; an internal note saying the direct evidence did not exist. Both hinged on a single paper neither had found.
What the independent check found
Every claim we publish goes through our independent verifier, Veritas, whose job is to try to break a claim, not confirm it, before it is final. Running its own literature search — built differently from ours — it surfaced the paper both of our assessments had missed: Sharma & Gitter 2025 (arXiv:2504.16886), which ran exactly that direct test — ESM-2 650M scored against deep-mutational-scanning data, ordered versus disordered regions, across 36 proteins. One paper resolved both errors at once: it exists, so the internal “no study tested this” was wrong — and it is a single preprint, so the live “documented boundary condition” was far too strong.
Why our own search missed it
This is the part worth keeping. Our four searches all shared one framing — “ESM is blind in disordered regions.” The paper is framed around structure-based fitness prediction, with disorder as a secondary result, so it was invisible to our framing. A search that reuses the original framing reproduces the original blind spot. That is why re-running the same search a second time does not catch a false negative — only an independently constructed pass does. Precision (is a result we found real?) and recall (did we find the results that exist?) fail differently, and recall fails silently.
What we changed
We corrected the two live pages from “documented boundary condition” down to the bounded version — as of 2026-06-13, one preprint provides direct evidence, not an established result — and reframed the internal assessment to that same bounded statement, re-verified at exactly that scope. One consistent, bounded claim everywhere, matching what the single source actually supports.
Why this is the point
We route claims through a separate verifier instead of self-certifying for one structural reason: a producer cannot see its own blind spot, by definition. Here a single independent check did two things we could not do for ourselves — it overturned an internal conclusion and walked back an over-claim already live on our pages, in opposite directions, from one source we had both missed. If you are weighing whether to lean on a computational score near a clinical decision, the useful question is not whether a vendor ever gets something wrong — everyone does — but whether there is a mechanism that catches it before you do. This is ours, working on our own claims.
A working note in our Verification in practice series. Companion: The Feature We Didn’t Ship.