The Feature We Didn't Ship
On this page
We almost built a confidence feature for our variant scorer. Before we wrote a line of production code, we wrote down the test it had to pass — had it witnessed and locked, with no way to move the goalposts afterward. It failed the test. So we did not build it. This note is about that decision, because the discipline is the whole point.
The feature we wanted
A per-position confidence view: for each position in a protein, signal whether ESM-2’s score there is reliable or essentially a guess. That is a genuinely useful thing to want — a score you cannot calibrate is a score you cannot act on. The candidate signal was per-position entropy: how spread-out the model’s prediction is at each position.
The bar, written down first
Before running anything, we set the exact pass/fail criteria the signal had to clear, and had our independent verifier witness and lock them — with no relaxation clause. That last part is the load-bearing move: once you have seen the results, you can always find a threshold that makes a borderline result look like a win. Pre-registration removes that temptation by construction.
The bar required the signal to predict per-position reliability across at least six proteins, and it included a deliberate falsifier — an ordered protein, where high entropy should not predict unreliability. If the falsifier fired, the signal was tracking something other than reliability.
It failed
The falsifier fired. On an ordered protein (influenza hemagglutinin, UniProt C6KNH7), high entropy predicted better reliability (correlation +0.077) — the exact wrong direction. Calibration held in only one of six proteins, where we had required at least four. Two looser checks did pass, but we had pre-committed not to use them to rescue the failed ones, and we did not.
The read: this entropy signal tracks intrinsic disorder, not reliability. Shipped as a general confidence view, it would have painted “low confidence” onto ordered positions that ESM-2 actually scores well — misleading exactly where a user would trust it. To be precise about scope: this is a result about this specific instrument, not a verdict on ESM-2 generally.
Why we’re telling you about a feature that doesn’t exist
The gate working is the product. We would rather kill a plausible feature than ship one that misleads in the regime users care about — and a pre-registered, witnessed bar is what makes “we decided not to ship” a discipline rather than a mood. For anyone weighing a vendor’s tools near a clinical decision, the failure mode to fear is not a missing confidence signal; it is a confidence signal that quietly misleads. We built a test for exactly that, our own instrument failed it, and we stopped. There is no confidence track. That is the honest outcome.
A working note in our Verification in practice series. Companion: An Independent Check Caught a Claim We’d Overstated.