Verification Report

This report documents independent claim verification for the article ESM-2 Ranks 43rd of 95 on the ProteinGym PG_v1.3 Leaderboard. It Also Requires a GPU to Run.

Validation Methodology

Validation methodology: ProteinGym PG_v1.3 archive (OATML-Markslab/proteingym, tag PG_v1.3, sha256 89c0cdd4; original benchmark design from Notin et al. 2023, NeurIPS). Reference baseline: ESM-2 (650M) Spearman rho = 0.414 — the model’s published value on this dataset, per the byte-pinned PG_v1.3 leaderboard CSV (rank 43 of 95 models). NeuroAutomata 5-protein validation: median rho = 0.515. Per-protein and per-category breakdowns reported as descriptive statistics against the pinned-tag cross-model category medians, rather than a live/moving leaderboard pointer.

Summary

Metric	Value
Claims	25 total
Verified	21
Unverifiable	4
Not assessed	0
Contradicted	0
Numerically verified (Z3)	17 of 25 — all passed, 0 contradictions
Citation-verified (L4 source review)	4 (CLM-2026-0069, 0072, 0092, 0093)
Sources fetched	21/25
Staleness flags	0
Unverifiable claims	4 (CLM-2026-0087, 0088, 0089, 0090)
Unique sources	8
Verification date	2026-07-17
Verified by	Veritas v1.1
Pipeline version	content-build process v1.1
Layers active	Layer 1: Z3 Numerical, Layer 4: Semantic
Layers not yet available	Layer 2: Ontological, Layer 3: ACMG/AMP

Layer 4 (semantic) was applied this pass as independent source-level review of the 5 vendored primary sources (ProteinGym PG_v1.3, Olson 2014 GB1, Adzhubei 2010 PolyPhen-2, Cheng 2024 VariPred, Findlay 2018 BRCA1), producing per-claim L4 verdicts on 6 of 25 claims. This is a graded verdict, not a page-level “all verified” pass: 21 of 25 live claims verified — 17 numerically via Z3 (Layer 1) and 4 via L4 citation review — and 4 could not be independently confirmed. The 4 unverifiable (CLM-2026-0087 through 0090) resolved on substrate grounds — a flagged-for-replacement internal benchmark or no accessible source — not by a semantic verdict; they are labeled accordingly below. CLM-2026-0085 and CLM-2026-0086 have been removed from the register (cut from the article; no longer live claims).

Claims Register

What was checked and whether it passed. Claim text is the original prose from the article.

CLM-2026-0069 — VERIFIED The performance gap between legacy tools and modern protein language models is documented in the ProteinGym substitution benchmark (Notin et al. 2023, NeurIPS) across 217 protein assays. Notin et al. 2023 · NeurIPS 2023 (byte-pinned snapshot) · L1: NOT_APPLICABLE
CLM-2026-0070 — VERIFIED ESM-2 650M ranks 43 of 95 models on the ProteinGym PG_v1.3 leaderboard. ProteinGym PG_v1.3 (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4) · leaderboard CSV · L1: VERIFIED
CLM-2026-0071 — VERIFIED The published ESM-2 650M result on the full ProteinGym suite (217 assays) is rho = 0.414 aggregate (mean) — rank 43 of 95 models on the ProteinGym PG_v1.3 leaderboard. ProteinGym PG_v1.3 (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4); benchmark design Notin et al. 2023 NeurIPS · leaderboard CSV · L1: VERIFIED
CLM-2026-0072 — VERIFIED GB1 (protein G B1 domain), a field-standard benchmark with experimental measurements for 1,045 single mutants (Olson et al. 2014). Olson, Wu & Sun 2014 · doi:10.1016/j.cub.2014.09.072 (byte-pinned snapshot) · L1: NOT_APPLICABLE
CLM-2026-0073 — VERIFIED Our internal validation result of rho = 0.276 (p < 10⁻¹⁹) confirmed the signal is real. Axon Agentic internal validation, engine confirmed against Wu et al. 2016 · L1: VERIFIED
CLM-2026-0074 — VERIFIED Beta-lactamase (BLAT) | BLAT_ECOLX_Stiffler_2015 | Enzymatic activity | 0.731. Axon Agentic internal validation vs. ProteinGym BLAT_ECOLX_Stiffler_2015 · L1: VERIFIED
CLM-2026-0075 — VERIFIED Cross-model activity-category median rho = 0.417 across 95 models on the ProteinGym PG_v1.3 leaderboard. ProteinGym PG_v1.3 (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4), Function_Activity column cross-model median · leaderboard CSV · L1: VERIFIED
CLM-2026-0076 — VERIFIED BRCA1 | BRCA1_HUMAN_Findlay_2018 | Activity (SGE, 95.9% (162/169) ClinVar concordance) | 0.515. Axon Agentic internal validation vs. ProteinGym BRCA1_HUMAN_Findlay_2018 · L1: VERIFIED
CLM-2026-0077 — VERIFIED Cross-model activity-category median rho = 0.417 across 95 models on the ProteinGym PG_v1.3 leaderboard. (Duplicate of CLM-2026-0075 retained for blog-page-anchor historical reference; Veritas verifies once.) ProteinGym PG_v1.3 (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4), Function_Activity column cross-model median · leaderboard CSV · L1: VERIFIED
CLM-2026-0078 — VERIFIED UBC9 | UBC9_HUMAN_Weile_2017 | Expression | 0.473. Axon Agentic internal validation vs. ProteinGym UBC9_HUMAN_Weile_2017 · L1: VERIFIED
CLM-2026-0079 — VERIFIED Cross-model expression-category median rho = 0.417 across 95 models on the ProteinGym PG_v1.3 leaderboard. ProteinGym PG_v1.3 (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4), Function_Expression column cross-model median · leaderboard CSV · L1: VERIFIED
CLM-2026-0080 — VERIFIED PTEN | PTEN_HUMAN_Mighell_2018 | Organismal fitness | 0.519. Axon Agentic internal validation vs. ProteinGym PTEN_HUMAN_Mighell_2018 · L1: VERIFIED
CLM-2026-0081 — VERIFIED Cross-model organismal-fitness-category median rho = 0.384 across 95 models on the ProteinGym PG_v1.3 leaderboard. ProteinGym PG_v1.3 (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4), Function_OrganismalFitness column cross-model median · leaderboard CSV · L1: VERIFIED
CLM-2026-0082 — VERIFIED Calmodulin (CALM1) | CALM1_HUMAN_Weile_2017 | Binding | 0.212 — well below ESM-2 650M’s own published aggregate of 0.414, consistent with the model’s known binding-affinity weakness. Axon Agentic internal validation vs. ProteinGym CALM1_HUMAN_Weile_2017 · L1: VERIFIED
CLM-2026-0083 — VERIFIED Cross-model binding-category median rho = 0.326 across 95 models on the ProteinGym PG_v1.3 leaderboard. ProteinGym PG_v1.3 (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4), Function_Binding column cross-model median · leaderboard CSV · L1: VERIFIED
CLM-2026-0084 — VERIFIED Median rho = 0.515 across 5 proteins. Axon Agentic internal validation, 5-protein subset · L1: VERIFIED
CLM-2026-0087 — UNVERIFIABLE — couldn’t independently verify For CYP2C9, the first gene in the ESM-2 Benchmark Series, rho = 0.679 overall. Reason: internal CYP2C9 benchmark on a curation-flagged substrate pending replacement — round-trips only to a self-named cherry-pick benchmark file, not an independent primary source. Axon Agentic CYP2C9 benchmark post · L1: UNVERIFIABLE
CLM-2026-0088 — UNVERIFIABLE — couldn’t independently verify rho = 0.811 in the heme-binding domain where most known pharmacogenomic alleles sit. Reason: internal CYP2C9 benchmark on a curation-flagged substrate pending replacement — round-trips only to a self-named cherry-pick benchmark file, not an independent primary source. Axon Agentic CYP2C9 benchmark post · L1: UNVERIFIABLE
CLM-2026-0089 — UNVERIFIABLE — couldn’t independently verify A ~2× performance gap between the heme-binding domain (rho = 0.811) and the substrate recognition site (SRS5, rho = 0.422). Reason: internal CYP2C9 benchmark on a curation-flagged substrate pending replacement — round-trips only to a self-named cherry-pick benchmark file, not an independent primary source. Axon Agentic CYP2C9 benchmark post · L1: UNVERIFIABLE
CLM-2026-0090 — UNVERIFIABLE — couldn’t independently verify Activity-without-abundance variants (a subset of pharmacogenomic deficiency alleles, including characterized examples across TPMT, NUDT15, CYP2C9, and CYP2C19). Reason: unpublished internal analysis, no accessible source. Axon Agentic internal analysis (unpublished) · L1: UNVERIFIABLE
CLM-2026-0091 — VERIFIED Precision varies from 0.62 (CYP2C9 activity) to 0.16 (NUDT15) at the same cutoff, evaluated across six pharmacogenes (not “enzymes” — corrected subject) in the Benchmark Series threshold validation. axon_kb/research/threshold-validation.json · L1: VERIFIED
CLM-2026-0092 — VERIFIED PolyPhen-2 was published in 2010. Adzhubei et al. 2010 · doi:10.1038/nmeth0410-248 (byte-pinned snapshot) · L1: NOT_APPLICABLE
CLM-2026-0093 — VERIFIED Protein language models have existed since 2021 (ESM-1b; Rives et al. 2021) — a fact of existence, distinct from a documented performance margin over PolyPhen-2, which was established in 2024 (VariPred: ESM-1b MCC = 0.649 vs. PolyPhen-2 MCC = 0.521, ClinVar n = 12,853). The prior citation (Lin et al. 2023, doi:10.1126/science.ade2574) was incorrect for this claim and has been dropped. Existence: Rives et al. 2021 (ESM-1b) · doi:10.1073/pnas.2016239118. Margin: Cheng et al. 2024 (VariPred) · doi:10.1038/s41598-024-51489-7 (byte-pinned snapshot) · L1: NOT_APPLICABLE
CLM-2026-0094 — VERIFIED 95.9% (162 of 169 ClinVar-pathogenic SNVs) concordance with ClinVar pathogenic classifications (Findlay et al. 2018, BRCA1 SGE). Findlay et al. 2018 · doi:10.1038/s41586-018-0461-z (byte-pinned snapshot) · L1: VERIFIED
CLM-2026-0095 — VERIFIED 90.9% (20 of 22 ClinVar-benign SNVs) concordance with ClinVar benign classifications (Findlay et al. 2018, BRCA1 SGE). Findlay et al. 2018 · doi:10.1038/s41586-018-0461-z (byte-pinned snapshot) · L1: VERIFIED

What Was Verified

This is a graded verdict — 21 verified, 4 unverifiable, 0 not_assessed, 0 contradicted, across 25 live claims. CLM-2026-0085 and CLM-2026-0086 have been removed from the article and do not appear in this register.

Verified (21): confirmed by recompute against the byte-pinned ProteinGym PG_v1.3 leaderboard (tag PG_v1.3, sha256 89c0cdd4), citation-check against byte-pinned primary sources, or direct pointer to internal reference data (axon_kb/research/threshold-validation.json).

Layer 1 (Z3 numerical): 17 quantitative claims checked for arithmetic consistency — rho values within correct ranges (−1 to 1), percentage deltas computed from stated baselines, ranking bounds valid (1 ≤ rank ≤ total), derived values consistent with stated inputs. 0 contradictions found. (4 claims carry no quantitative decomposition and are marked NOT_APPLICABLE at L1; 4 unverifiable claims emitted no L1 verdict.)
Layer 4 (semantic): Independent source-level review of the 5 vendored primary sources produced per-claim L4 verdicts on 6 of 25 claims. Sources reviewed at L4: ProteinGym PG_v1.3 leaderboard CSV (OATML-Markslab/ProteinGym, tag PG_v1.3, sha256 89c0cdd4; benchmark design Notin et al. 2023, NeurIPS) → CLM-2026-0069; Olson et al. 2014 (GB1 single-mutant dataset) → 0072; Adzhubei et al. 2010 (PolyPhen-2) → 0092; Cheng et al. 2024 / VariPred (doi:10.1038/s41598-024-51489-7) → 0093; Findlay et al. 2018 (BRCA1 SGE, doi:10.1038/s41586-018-0461-z) → 0094/0095. Additional primary sources underlying recomputed or internal-reference claims (not L4-reviewed this pass): Mighell et al. 2018 (PTEN), Weile et al. 2017 (UBC9, CALM1), Stiffler et al. 2015 (BLAT).

Unverifiable (4) — couldn’t independently verify: CLM-2026-0087, 0088, 0089 (CYP2C9 heme-domain and SRS5 figures) round-trip only to a self-named internal benchmark file on a substrate flagged for curation replacement — not an independent primary source. CLM-2026-0090 (activity-without-abundance framing) is an unpublished internal analysis with no accessible source. None of the four are asserted as false — “unverifiable” means Veritas could not independently confirm them this pass, not that they were checked and found wrong. The article carries an inline preliminary/pending-verification qualifier at first mention of the CYP2C9 figures; treat them as provisional pending Part 2’s independent substrate replacement.

NOT_APPLICABLE (L1 column): claim has no quantitative decomposition field — skipped by Z3, verified by Layer 4 semantic review only.

Layers 2 (ontological) and 3 (ACMG/AMP classification) are on the roadmap.

Pipeline

This article was produced using the Axon Agentic content build pipeline:

RESOLVE → DRAFT → LINT → TEST → REVIEW → PUBLISH

Verification is performed by Veritas, an independent agent with no involvement in content production.