Verification Report
Verification Report: TP53 VUS / ESM-2 Domain-Stratified Benchmark
Verification Report
This report documents the independent verification of the article Ranking TP53 VUS with ESM-2: Strong in the DNA-Binding Domain, Blind in the Disordered Regions.
Validation Methodology
This is a single-candidate, single-protein report: full-length TP53 (UniProt P04637, 393 amino acids), evaluated against ESM-2 650M and cross-referenced to two published deep mutational scanning (DMS) datasets. The whole protein is scored as one sequence, and results are reported both globally and stratified by structural domain — because TP53’s signal is concentrated in its folded core and absent in its disordered regions, a single whole-protein number understates the tool.
Primary DMS source — Giacomelli et al. 2018, Nature Genetics 50(10):1381–1387 (doi:10.1038/s41588-018-0204-y). ~7,400 missense variants across the full protein (residues 1–393), generated by mutagenesis-by-integrated-TILE (MITE) lentiviral saturation mutagenesis, under three selection conditions: p53-null + nutlin-3 (Null-Nutlin), wild-type-p53 + nutlin-3 (WT-Nutlin), and p53-null + etoposide (Null-Etoposide). Data accessed via ProteinGym v1.3 (Notin et al. 2023, NeurIPS).
Primary DMS source — Kotler et al. 2018, Molecular Cell 71(1):178–190 (doi:10.1016/j.molcel.2018.06.012). ProteinGym v1.3 distributes a 1,048-variant single-missense DNA-binding-domain subset of this assay; the full paper assayed ~10,000 DBD variants. We report against the 1,048-variant ProteinGym subset, and state that scope explicitly.
ESM-2 scoring. 650M-parameter masked-marginal scoring on the NeuroAutomata platform, full-length P04637 (393 amino acids, single sequence), on Modal serverless T4 hardware (FP16), scored as the log-likelihood ratio of each variant against the wild-type residue.
Global and regional results (Giacomelli Null-Nutlin unless noted). Global Spearman rho = 0.41 (N = 7,464) — diluted by the protein’s intrinsically disordered regions. Stratified by domain:
- DNA-binding domain (residues 102–292): rho = 0.46 (N = 3,628, Giacomelli) and 0.68 (N = 1,048, Kotler). This is the strong-signal region — the folded core where the large majority of pathogenic germline TP53 variants sit, and where the model’s ordering agrees with the experiment.
- Tetramerization domain (residues 323–356): rho = 0.38 under the Null assays but −0.41 under WT-Nutlin (N = 646) — a weak, assay-dependent result whose sign is unstable across conditions, disclosed as a limitation rather than led with.
- Intrinsically disordered regions (TAD1 1–40, TAD2 41–61, proline-rich 62–97, C-terminal regulatory 357–393): rho 0 to −0.26 — no usable signal. This fits the model’s applicability domain — reliability drops where evolutionary signal is thin, as it often is in disordered regions; the disorder-specific reading of it rests on a single preprint (Sharma & Gitter 2025, arXiv:2504.16886) as of 2026-06-13, not an established boundary condition.
The honest framing is a range, not a ratio: ESM-2 is strong in the DNA-binding domain (0.46–0.68) and has no usable signal in the disordered regions (0 to −0.26). We report the regional values directly and do not compute a DBD-versus-IDR multiplier.
Hotspots and worked examples. Across the eleven canonical cancer hotspots, ESM-2 and the Giacomelli assay are concordant on the loss-of-function (fold-destruction) axis — every hotspot scores negative under ESM-2 and loss-of-function under the DMS, most deleterious at R248W (ESM-2 −12.47). This is loss-of-function detection, not gain-of-function prediction: several of the canonical hotspots (e.g., R175H, R273H) are gain-of-function neomorphs, and ESM-2 is blind to that oncogenic gain (see disclosed weaknesses). The hepatocellular-carcinoma hotspot R249S (ESM-2 −6.34, loss-of-function in all four assay readouts) leads the germline worked example. The Li-Fraumeni founder variant R337H is the honest counter-case: ESM-2 scores it mild (−2.56), and so do all three standard 37 °C Giacomelli assays — a shared model-and-assay blind spot for the temperature-conditional defect, not a model-specific under-call.
Baseline. The published ESM-2 650M zero-shot baseline across the full ProteinGym v1.3 suite is Spearman rho 0.414 (217 assays; rank 45 of 97 models), read from the public ProteinGym leaderboard. Every TP53 regional value above is reported against that baseline.
How to read the Spearman correlation. Spearman rho measures agreement on the ordering of variant effects, not per-variant accuracy. A DBD rho of 0.46 corresponds to a concordance probability of (1 + 0.46)/2 ≈ 0.73 — ESM-2 and the experiment agree on the rank-ordering of variant impact within the domain about 73% of the time, a little under three in four. For clinical use, this is a prioritization signal, not a classification: ESM-2 helps rank which DNA-binding-domain VUS to send for functional testing first; it does not classify them. The published article authors no ACMG/AMP variant classifications — every VUS-related claim is descriptive (rank prioritization).
Summary
| Metric | Value |
|---|---|
| Claims | 15 total (CLM-2026-3401 … CLM-2026-3415) |
| Result | All checked; the page passed verification |
| Per-claim outcome | 15 of 15 verified (0 contradicted, 0 unverifiable) — shown claim-by-claim in the table below |
| Numbers checked for arithmetic consistency | Every quantitative claim (Spearman rho, sample size N, and categorical-agreement value) checked for internal consistency — rho values within the valid −1 to 1 range, sample-size accounting, and the domain-stratified accounting against the global value |
| Categorical classifications made | 0 (none authored — see “What Was Checked”) |
| Source data | TP53 full-length validation dataset (versions recorded below at build time) |
| Verified | 2026-06-04 (verdict-of-record 6cb0024) |
| Verified by | Veritas |
What Was Checked
Source-against-claim review (all 15 claims). Every claim was checked end-to-end against the primary source it cites — whether the source actually reports what we claim, and whether the scope matches. The stricter framing-fidelity check — whether the prose framing is faithful to what the cited authors reported — ran additionally on the verified subset.
Cited primary sources (independently resolved and checked figure-by-figure): Giacomelli et al. 2018 Nature Genetics (the three-condition saturation DMS); Kotler et al. 2018 Molecular Cell (the DNA-binding-domain proliferation assay); and the ProteinGym v1.3 substitution DMS leaderboard (the published 0.414 full-suite baseline). The benchmark claims are ESM-2’s own measurements compared against these source DMS datasets.
Context sources (referenced in the article for background, outside the formal benchmark-claim scope): UniProt P04637 (TP53 sequence reference), and the germline / clinical-context citations the article cites for framing — Guha & Malkin 2017 (DNA-binding-domain clustering of germline variants), Garritano et al. 2010 (the R337H Brazilian founder allele), and Federici & Soddu 2020 (the variant-of-uncertain-significance clinical-context figure). These supply background, not benchmark numbers.
Arithmetic-consistency check. Every Spearman rho, sample size, and categorical-agreement value was checked for internal consistency: rho values within the valid −1 to 1 range across the global result (0.41), the DNA-binding domain (0.46 Giacomelli, 0.68 Kotler), the tetramerization domain (0.38 / −0.41), and the disordered regions (0 to −0.26); the sample-size accounting (global N = 7,464 against the domain subsets); and the hotspot loss-of-function concordance count (11 of 11).
Single-source measurement grain. Each regional Spearman value is ESM-2’s measurement against a single primary DMS for that region — one DMS record or one leaderboard row — not a value corroborated across two independent datasets. All 15 claims are verified (each round-trips to its cited substrate under the numerical check); this note discloses the single-primary-source grain so that verified is read precisely — correctly grounded in the cited source, not independently cross-validated.
Categorical-classification check — 0 claims, none authored. Every TP53 variant claim is framed as a descriptive measurement — rank prioritization, regional correlation, loss-of-function concordance — not as an ACMG/AMP variant classification. The article states it plainly: ESM-2 helps prioritize which VUS to test first; it does not classify them. The underlying record matches that framing throughout.
How This Page Works
This is a single-protein report. The full-length TP53 substrate carries one evidence trail — one protein, one ESM-2 scoring run, cross-referenced to the Giacomelli and Kotler DMS datasets — and the claim list sits at the level of the published article. Each claim records the source it is checked against, so the per-claim source is auditable; the claim-by-claim table below shows that mapping, with each claim’s verification state derived from the verdict overlay.
The exact source-data versions and verification-result version used at this page’s most recent build are shown in the footer below — an evidence chain back to a specific, reproducible state.
Verification independence
Verification is performed by Veritas, an independent agent with no involvement in producing the content it checks.
Claims Register
| Claim ID | Claim | Source | Strength | Status |
|---|---|---|---|---|
| CLM-2026-3401 | ~7,400 TP53 missense variants, MITE saturation mutagenesis, three selection conditions | Giacomelli et al. 2018, Nature Genetics 50(10):1381-1387 | strong | CLM-2026-3401 Verified |
| CLM-2026-3402 | Kotler 2018 ProteinGym v1.3 subset: 1,048 DBD single-missense variants (of the paper's ~10,000) | Kotler et al. 2018, Molecular Cell 71(1):178-190 | strong | CLM-2026-3402 Verified |
| CLM-2026-3403 | Full-length TP53 vs Giacomelli Null-Nutlin: N=7,464, Spearman rho = 0.41 | Axon Agentic ESM-2 TP53 full-length cross-reference (Null_Nutlin) | strong | CLM-2026-3403 Verified |
| CLM-2026-3404 | DBD (102-292): N=3,628, Spearman rho = 0.46 — STRONG; reproduces prior 0.463 | Axon Agentic ESM-2 TP53 DBD domain-stratified cross-reference | strong | CLM-2026-3404 Verified |
| CLM-2026-3405 | DBD via Kotler: N=1,048, Spearman rho = 0.68 — strongest TP53 correlation | Axon Agentic ESM-2 TP53 Kotler cross-reference | strong | CLM-2026-3405 Verified |
| CLM-2026-3406 | Tetramerization (323-356): N=646, rho = 0.38 (Null assays), -0.41 (WT-Nutlin) — WEAK/assay-dependent | Axon Agentic ESM-2 TP53 tetramerization domain-stratified cross-reference | moderate | CLM-2026-3406 Verified |
| CLM-2026-3407 | IDRs (TAD1/TAD2/proline-rich/regulatory): rho approx 0 to -0.26 — WEAK, no signal | Axon Agentic ESM-2 TP53 IDR domain-stratified cross-reference | strong | CLM-2026-3407 Verified |
| CLM-2026-3408 | 11/11 cancer hotspots concordant on the LOF (fold-destruction) axis — ESM-2 negative AND Giacomelli LOF | Axon Agentic ESM-2 TP53 hotspot concordance (full-length substrate) | strong | CLM-2026-3408 Verified |
| CLM-2026-3409 | R249S (HCC): ESM-2 -6.34; DMS LOF in all 4 assays (WT-Nutlin 4th pctile) — concordant | Axon Agentic ESM-2 TP53 R249S — of-record substrate (ESM-2 LLR) + ProteinGym v1.3 DMS (per-variant LOF) | strong | CLM-2026-3409 Verified |
| CLM-2026-3410 | R337H: ESM-2 mild (-2.56) AND all 3 Giacomelli assays WT-like (+0.21/+0.20/+0.96) — SHARED blind-spot, not model under-call | Axon Agentic ESM-2 TP53 R337H — of-record substrate (ESM-2 LLR) + Giacomelli DMS (per-variant WT-like, all 3 assays) | strong | CLM-2026-3410 Verified |
| CLM-2026-3411 | ESM-2 is GOF-blind: hotspot negative scores are LOF (fold destruction) detection, NOT a gain-of-function prediction — disclosed weakness (Boundary 7) | Axon Agentic ESM-2 boundary catalog (Boundary 7, gain-of-function) + TP53 hotspot LOF concordance | strong | CLM-2026-3411 Verified |
| CLM-2026-3412 | 11/11 TP53 hotspots LOF in Giacomelli Null-Nutlin (per-variant DMS) | Giacomelli et al. 2018 Null-Nutlin per-variant DMS (ProteinGym v1.3), TP53 hotspot subset | strong | CLM-2026-3412 Verified |
| CLM-2026-3413 | 11/11 TP53 hotspots LOF in Giacomelli Null-Etoposide (per-variant DMS) | Giacomelli et al. 2018 Null-Etoposide per-variant DMS (ProteinGym v1.3), TP53 hotspot subset | strong | CLM-2026-3413 Verified |
| CLM-2026-3414 | 11/11 TP53 hotspots LOF in Giacomelli WT-Nutlin (per-variant DMS) | Giacomelli et al. 2018 WT-Nutlin per-variant DMS (ProteinGym v1.3), TP53 hotspot subset | strong | CLM-2026-3414 Verified |
| CLM-2026-3415 | ESM-2 650M ProteinGym full-suite baseline: Spearman rho = 0.414 (217 assays, rank 45/97) | ProteinGym v1.3 leaderboard (Notin et al. 2023), ESM-2 650M aggregate Spearman — public leaderboard | strong | CLM-2026-3415 Verified |
Substantiated against the TP53 validation claim list
(version 6cb0024); the verification result on record, at version
10aa958 (both captured at build time;
cross-checked source trail).