Verification Report

Verification Report: TP53 VUS / ESM-2 Domain-Stratified Benchmark

Published · Updated

Verification Report

This report documents the independent verification of the article Ranking TP53 VUS with ESM-2: Strong in the DNA-Binding Domain, Blind in the Disordered Regions.


Validation Methodology

This is a single-candidate, single-protein report: full-length TP53 (UniProt P04637, 393 amino acids), evaluated against ESM-2 650M and cross-referenced to two published deep mutational scanning (DMS) datasets. The whole protein is scored as one sequence, and results are reported both globally and stratified by structural domain — because TP53’s signal is concentrated in its folded core and absent in its disordered regions, a single whole-protein number understates the tool.

Primary DMS source — Giacomelli et al. 2018, Nature Genetics 50(10):1381–1387 (doi:10.1038/s41588-018-0204-y). ~7,400 missense variants across the full protein (residues 1–393), generated by mutagenesis-by-integrated-TILE (MITE) lentiviral saturation mutagenesis, under three selection conditions: p53-null + nutlin-3 (Null-Nutlin), wild-type-p53 + nutlin-3 (WT-Nutlin), and p53-null + etoposide (Null-Etoposide). Data accessed via ProteinGym v1.3 (Notin et al. 2023, NeurIPS).

Primary DMS source — Kotler et al. 2018, Molecular Cell 71(1):178–190 (doi:10.1016/j.molcel.2018.06.012). ProteinGym v1.3 distributes a 1,048-variant single-missense DNA-binding-domain subset of this assay; the full paper assayed ~10,000 DBD variants. We report against the 1,048-variant ProteinGym subset, and state that scope explicitly.

ESM-2 scoring. 650M-parameter masked-marginal scoring on the NeuroAutomata platform, full-length P04637 (393 amino acids, single sequence), on Modal serverless T4 hardware (FP16), scored as the log-likelihood ratio of each variant against the wild-type residue.

Global and regional results (Giacomelli Null-Nutlin unless noted). Global Spearman rho = 0.41 (N = 7,464) — diluted by the protein’s intrinsically disordered regions. Stratified by domain:

  • DNA-binding domain (residues 102–292): rho = 0.46 (N = 3,628, Giacomelli) and 0.68 (N = 1,048, Kotler). This is the strong-signal region — the folded core where the large majority of pathogenic germline TP53 variants sit, and where the model’s ordering agrees with the experiment.
  • Tetramerization domain (residues 323–356): rho = 0.38 under the Null assays but −0.41 under WT-Nutlin (N = 646) — a weak, assay-dependent result whose sign is unstable across conditions, disclosed as a limitation rather than led with.
  • Intrinsically disordered regions (TAD1 1–40, TAD2 41–61, proline-rich 62–97, C-terminal regulatory 357–393): rho 0 to −0.26 — no usable signal. This fits the model’s applicability domain — reliability drops where evolutionary signal is thin, as it often is in disordered regions; the disorder-specific reading of it rests on a single preprint (Sharma & Gitter 2025, arXiv:2504.16886) as of 2026-06-13, not an established boundary condition.

The honest framing is a range, not a ratio: ESM-2 is strong in the DNA-binding domain (0.46–0.68) and has no usable signal in the disordered regions (0 to −0.26). We report the regional values directly and do not compute a DBD-versus-IDR multiplier.

Hotspots and worked examples. Across the eleven canonical cancer hotspots, ESM-2 and the Giacomelli assay are concordant on the loss-of-function (fold-destruction) axis — every hotspot scores negative under ESM-2 and loss-of-function under the DMS, most deleterious at R248W (ESM-2 −12.47). This is loss-of-function detection, not gain-of-function prediction: several of the canonical hotspots (e.g., R175H, R273H) are gain-of-function neomorphs, and ESM-2 is blind to that oncogenic gain (see disclosed weaknesses). The hepatocellular-carcinoma hotspot R249S (ESM-2 −6.34, loss-of-function in all four assay readouts) leads the germline worked example. The Li-Fraumeni founder variant R337H is the honest counter-case: ESM-2 scores it mild (−2.56), and so do all three standard 37 °C Giacomelli assays — a shared model-and-assay blind spot for the temperature-conditional defect, not a model-specific under-call.

Baseline. The published ESM-2 650M zero-shot baseline across the full ProteinGym v1.3 suite is Spearman rho 0.414 (217 assays; rank 45 of 97 models), read from the public ProteinGym leaderboard. Every TP53 regional value above is reported against that baseline.

How to read the Spearman correlation. Spearman rho measures agreement on the ordering of variant effects, not per-variant accuracy. A DBD rho of 0.46 corresponds to a concordance probability of (1 + 0.46)/2 ≈ 0.73 — ESM-2 and the experiment agree on the rank-ordering of variant impact within the domain about 73% of the time, a little under three in four. For clinical use, this is a prioritization signal, not a classification: ESM-2 helps rank which DNA-binding-domain VUS to send for functional testing first; it does not classify them. The published article authors no ACMG/AMP variant classifications — every VUS-related claim is descriptive (rank prioritization).


Summary

MetricValue
Claims15 total (CLM-2026-3401 … CLM-2026-3415)
ResultAll checked; the page passed verification
Per-claim outcome15 of 15 verified (0 contradicted, 0 unverifiable) — shown claim-by-claim in the table below
Numbers checked for arithmetic consistencyEvery quantitative claim (Spearman rho, sample size N, and categorical-agreement value) checked for internal consistency — rho values within the valid −1 to 1 range, sample-size accounting, and the domain-stratified accounting against the global value
Categorical classifications made0 (none authored — see “What Was Checked”)
Source dataTP53 full-length validation dataset (versions recorded below at build time)
Verified2026-06-04 (verdict-of-record 6cb0024)
Verified byVeritas

What Was Checked

Source-against-claim review (all 15 claims). Every claim was checked end-to-end against the primary source it cites — whether the source actually reports what we claim, and whether the scope matches. The stricter framing-fidelity check — whether the prose framing is faithful to what the cited authors reported — ran additionally on the verified subset.

Cited primary sources (independently resolved and checked figure-by-figure): Giacomelli et al. 2018 Nature Genetics (the three-condition saturation DMS); Kotler et al. 2018 Molecular Cell (the DNA-binding-domain proliferation assay); and the ProteinGym v1.3 substitution DMS leaderboard (the published 0.414 full-suite baseline). The benchmark claims are ESM-2’s own measurements compared against these source DMS datasets.

Context sources (referenced in the article for background, outside the formal benchmark-claim scope): UniProt P04637 (TP53 sequence reference), and the germline / clinical-context citations the article cites for framing — Guha & Malkin 2017 (DNA-binding-domain clustering of germline variants), Garritano et al. 2010 (the R337H Brazilian founder allele), and Federici & Soddu 2020 (the variant-of-uncertain-significance clinical-context figure). These supply background, not benchmark numbers.

Arithmetic-consistency check. Every Spearman rho, sample size, and categorical-agreement value was checked for internal consistency: rho values within the valid −1 to 1 range across the global result (0.41), the DNA-binding domain (0.46 Giacomelli, 0.68 Kotler), the tetramerization domain (0.38 / −0.41), and the disordered regions (0 to −0.26); the sample-size accounting (global N = 7,464 against the domain subsets); and the hotspot loss-of-function concordance count (11 of 11).

Single-source measurement grain. Each regional Spearman value is ESM-2’s measurement against a single primary DMS for that region — one DMS record or one leaderboard row — not a value corroborated across two independent datasets. All 15 claims are verified (each round-trips to its cited substrate under the numerical check); this note discloses the single-primary-source grain so that verified is read precisely — correctly grounded in the cited source, not independently cross-validated.

Categorical-classification check — 0 claims, none authored. Every TP53 variant claim is framed as a descriptive measurement — rank prioritization, regional correlation, loss-of-function concordance — not as an ACMG/AMP variant classification. The article states it plainly: ESM-2 helps prioritize which VUS to test first; it does not classify them. The underlying record matches that framing throughout.


How This Page Works

This is a single-protein report. The full-length TP53 substrate carries one evidence trail — one protein, one ESM-2 scoring run, cross-referenced to the Giacomelli and Kotler DMS datasets — and the claim list sits at the level of the published article. Each claim records the source it is checked against, so the per-claim source is auditable; the claim-by-claim table below shows that mapping, with each claim’s verification state derived from the verdict overlay.

The exact source-data versions and verification-result version used at this page’s most recent build are shown in the footer below — an evidence chain back to a specific, reproducible state.


Verification independence

Verification is performed by Veritas, an independent agent with no involvement in producing the content it checks.

Claims Register

Claim ID Claim Source Strength Status
CLM-2026-3401 ~7,400 TP53 missense variants, MITE saturation mutagenesis, three selection conditions Giacomelli et al. 2018, Nature Genetics 50(10):1381-1387
Data source
strong CLM-2026-3401 Verified
CLM-2026-3402 Kotler 2018 ProteinGym v1.3 subset: 1,048 DBD single-missense variants (of the paper's ~10,000) Kotler et al. 2018, Molecular Cell 71(1):178-190
Data source
strong CLM-2026-3402 Verified
CLM-2026-3403 Full-length TP53 vs Giacomelli Null-Nutlin: N=7,464, Spearman rho = 0.41 Axon Agentic ESM-2 TP53 full-length cross-reference (Null_Nutlin)
Data source
strong CLM-2026-3403 Verified
CLM-2026-3404 DBD (102-292): N=3,628, Spearman rho = 0.46 — STRONG; reproduces prior 0.463 Axon Agentic ESM-2 TP53 DBD domain-stratified cross-reference
Data source
strong CLM-2026-3404 Verified
CLM-2026-3405 DBD via Kotler: N=1,048, Spearman rho = 0.68 — strongest TP53 correlation Axon Agentic ESM-2 TP53 Kotler cross-reference
Data source
strong CLM-2026-3405 Verified
CLM-2026-3406 Tetramerization (323-356): N=646, rho = 0.38 (Null assays), -0.41 (WT-Nutlin) — WEAK/assay-dependent Axon Agentic ESM-2 TP53 tetramerization domain-stratified cross-reference
Data source
moderate CLM-2026-3406 Verified
CLM-2026-3407 IDRs (TAD1/TAD2/proline-rich/regulatory): rho approx 0 to -0.26 — WEAK, no signal Axon Agentic ESM-2 TP53 IDR domain-stratified cross-reference
Data source
strong CLM-2026-3407 Verified
CLM-2026-3408 11/11 cancer hotspots concordant on the LOF (fold-destruction) axis — ESM-2 negative AND Giacomelli LOF Axon Agentic ESM-2 TP53 hotspot concordance (full-length substrate)
Data source
strong CLM-2026-3408 Verified
CLM-2026-3409 R249S (HCC): ESM-2 -6.34; DMS LOF in all 4 assays (WT-Nutlin 4th pctile) — concordant Axon Agentic ESM-2 TP53 R249S — of-record substrate (ESM-2 LLR) + ProteinGym v1.3 DMS (per-variant LOF)
Data source
strong CLM-2026-3409 Verified
CLM-2026-3410 R337H: ESM-2 mild (-2.56) AND all 3 Giacomelli assays WT-like (+0.21/+0.20/+0.96) — SHARED blind-spot, not model under-call Axon Agentic ESM-2 TP53 R337H — of-record substrate (ESM-2 LLR) + Giacomelli DMS (per-variant WT-like, all 3 assays)
Data source
strong CLM-2026-3410 Verified
CLM-2026-3411 ESM-2 is GOF-blind: hotspot negative scores are LOF (fold destruction) detection, NOT a gain-of-function prediction — disclosed weakness (Boundary 7) Axon Agentic ESM-2 boundary catalog (Boundary 7, gain-of-function) + TP53 hotspot LOF concordance
Qualifying note
strong CLM-2026-3411 Verified
CLM-2026-3412 11/11 TP53 hotspots LOF in Giacomelli Null-Nutlin (per-variant DMS) Giacomelli et al. 2018 Null-Nutlin per-variant DMS (ProteinGym v1.3), TP53 hotspot subset
Data source
strong CLM-2026-3412 Verified
CLM-2026-3413 11/11 TP53 hotspots LOF in Giacomelli Null-Etoposide (per-variant DMS) Giacomelli et al. 2018 Null-Etoposide per-variant DMS (ProteinGym v1.3), TP53 hotspot subset
Data source
strong CLM-2026-3413 Verified
CLM-2026-3414 11/11 TP53 hotspots LOF in Giacomelli WT-Nutlin (per-variant DMS) Giacomelli et al. 2018 WT-Nutlin per-variant DMS (ProteinGym v1.3), TP53 hotspot subset
Data source
strong CLM-2026-3414 Verified
CLM-2026-3415 ESM-2 650M ProteinGym full-suite baseline: Spearman rho = 0.414 (217 assays, rank 45/97) ProteinGym v1.3 leaderboard (Notin et al. 2023), ESM-2 650M aggregate Spearman — public leaderboard
Data source
strong CLM-2026-3415 Verified

Substantiated against the TP53 validation claim list (version 6cb0024); the verification result on record, at version 10aa958 (both captured at build time; cross-checked source trail).