rho = 0.772 in BRCA1 BRCT: What 5,949 Cancer Variants Reveal About ESM-2's Limits

Published
research benchmarks cancer-vus esm-2 variant-classification
On this page

The VUS Problem in Cancer Genetics

41% of individuals undergoing genetic testing across clinical specialties received at least one variant of uncertain significance Variant of Uncertain Significance — a genetic variant found in a patient that hasn't been classified as definitively pathogenic or benign. Full definition (Chen et al., JAMA Network Open, 2023). Every hereditary cancer gene has hundreds — sometimes thousands — of VUS awaiting classification. A patient carrying a BRCA1 VUS faces a clinical limbo: the variant was detected, but its significance is unknown.

BRCA1 and PTEN are two of the most clinically important tumor suppressors. Germline BRCA1 mutations cause hereditary breast and ovarian cancer syndrome (lifetime breast cancer risk up to 72%). Germline PTEN mutations cause PTEN Hamartoma Tumor Syndrome, including Cowden syndrome (lifetime breast cancer risk 85%, thyroid 35%, endometrial 28%). Both proteins have been characterized by gold-standard functional assays — but the assays cover only a fraction of possible variants, and new VUS appear monthly as genetic testing scales.

The question: can a protein language model A deep learning model trained on millions of protein sequences to predict how mutations affect function. NeuroAutomata uses ESM-2, a PLM developed by Meta AI. Full definition provide reliable computational evidence for cancer VUS classification?

We tested this by cross-referencing ESM-2 650M against two gold-standard datasets:

  1. BRCA1 — 1,837 missense variants scored by saturation genome editing (SGE) A CRISPR-based assay that introduces every possible single-nucleotide variant at a genomic locus in its native context, then measures functional impact through cell viability or other selection. The gold standard for clinical variant classification in cancer genes. Full definition , the definitive functional assay for BRCA1 variant classification (Findlay et al. 2018, Nature)
  2. PTEN — 4,112 missense variants scored by VAMP-seq A high-throughput assay that measures protein abundance (expression and stability) for thousands of variants simultaneously using fluorescent protein fusions and flow cytometry. Full definition , measuring protein abundance as a proxy for stability and function (Matreyek et al. 2018, Nature Genetics)

Together: 5,949 cancer-related variants across two proteins, two assays, two independent research groups.

NOTE

Why two assays matter here: SGE measures cell viability — does the variant kill the cell? VAMP-seq measures protein abundance — does the variant destabilize the protein? These are different questions. A variant can destabilize the protein (VAMP-seq catches it) OR leave the protein structurally intact but catalytically dead (VAMP-seq misses it, but ESM-2 may catch it). Testing against both assays reveals complementary blind spots.


Scoring 5,949 Variants Across Two Proteins and Two Assays

BRCA1: We scored the RING domain (residues 1-101, zinc-binding E3 ubiquitin ligase) and the BRCT tandem repeat (residues 1631-1863, phosphopeptide-binding domain) as independent fragments using ESM-2’s masked marginal scoring ESM-2's zero-shot method for scoring variant effects. It masks each position and measures how surprising the mutant amino acid is relative to wild-type. Full definition . BRCA1’s full length (1,863 amino acids) exceeds the platform’s 1,024 amino acid input limit, but both domains fold independently (PDB 1JM7 for RING, PDB 1T29 for BRCT). We computed Spearman rank correlation A rank correlation coefficient (−1 to +1) that measures whether two variables agree in order, not magnitude. The primary metric for variant effect benchmarks. Full definition against the SGE function scores from the ProteinGym (Notin et al. 2023, NeurIPS, 217 assays) version of the Findlay dataset (1,837 missense variants).

PTEN: We scored the full 403 amino acid sequence (UniProt P60484) and computed Spearman correlation against VAMP-seq abundance scores from MaveDB (4,112 missense variants, 53.8% of all possible single amino acid substitutions).

NOTE

Saturation genome editing (SGE) uses CRISPR/Cas9 to introduce nearly every possible single-nucleotide variant at a genomic locus in human cells. Variants that impair protein function are depleted over time (cell viability selection). It is the gold standard for BRCA1 variant classification — achieving 95.9% concordance with ClinVar pathogenic and 90.9% with ClinVar benign classifications (Findlay et al. 2018). VAMP-seq measures intracellular protein abundance for thousands of variants in parallel using fluorescent fusions and cell sorting (Matreyek et al. 2018).

Runtime: ~30 seconds per protein on our platform. Both analyses combined: under 2 minutes.


Results

BRCA1: ESM-2 vs Saturation Genome Editing

DomainN (variants)Spearman rhop-valueCategorical agreement
BRCT (phosphopeptide-binding)1,2620.5349.8 × 10⁻⁹⁴67.7% (855/1,262)
RING (zinc-binding E3 ligase)5750.4091.4 × 10⁻²⁴75.5% (434/575)
Combined1,837

NOTE

What rho = 0.534 means in practice: Spearman rank correlation measures agreement on ordering — if SGE says Variant A is more damaging than Variant B, ESM-2 agrees on the ordering roughly 77% of the time (concordance probability = (1 + 0.534) / 2). It is not a per-variant accuracy rate. For clinical use: ESM-2 helps prioritize which VUS to investigate first — it does not independently classify them.

The BRCT domain — which harbors the majority of clinically actionable BRCA1 VUS — shows the stronger global correlation (0.534). The RING domain is weaker (0.409), likely because the 101 amino acid fragment provides less evolutionary context for ESM-2’s attention mechanism.

The Strongest Domain-Level Result: BRCT Non-Active-Site rho = 0.772

When we stratify the BRCT domain by position type, the non-active-site annotated positions achieve rho = 0.772 (N = 59, p = 8.6 × 10⁻¹³) — the second strongest domain-level correlation in our entire 25-protein benchmark portfolio, behind only CYP2C9’s heme-binding domain (rho = 0.811).

BRCT RegionNSpearman rhoCategorical agreement
Non-active-site annotated590.772
Active-site (phosphopeptide-binding)90.70088.9%
Unannotated positions1,1940.52767.2%

Why this matters: the BRCT tandem repeat is a highly conserved structural domain that mediates BRCA1’s interaction with phosphorylated proteins in the DNA damage response. Mutations that disrupt BRCT folding are strongly disfavored across evolution — and ESM-2 captures this constraint with high fidelity. For the majority of BRCA1 VUS that fall in the BRCT domain outside the phosphopeptide-binding interface, ESM-2 provides reliable ranking.

BRCA1 BRCT domain mutation landscape heatmap showing ESM-2 predicted effects for all 233 positions across 20 amino acid substitutions. Red indicates predicted harmful mutations, blue indicates tolerated. Clusters of deep red mark structurally critical positions. Global Spearman rho = 0.534 vs SGE.
Figure 1: ESM-2 mutation landscape for BRCA1 BRCT domain (233 positions × 20 amino acids). Red = predicted harmful, blue = tolerated. The dense red columns correspond to positions under strong evolutionary constraint — including non-active-site annotated residues that achieve rho = 0.772 against SGE function scores.
Bar chart showing mean ESM-2 sensitivity score per position across the BRCA1 BRCT domain. Most positions show moderate constraint. The deepest bars correspond to structurally critical positions driving the non-active-site rho = 0.772 result.
Figure 2: Per-position ESM-2 sensitivity across BRCA1 BRCT domain. Each bar represents the mean predicted effect of all substitutions at that position. Deeper bars = stronger evolutionary constraint. Non-active-site annotated positions (N=59) show the strongest constraint signal, consistent with the rho = 0.772 regional correlation.

RING Zinc-Binding: 100% Agreement Where It Matters Most

The BRCA1 RING domain coordinates two zinc ions through eight conserved residues (C24, C27, C39, H41, C44, C47, C61, C64). At these zinc-binding positions, ESM-2 and SGE achieve 100% categorical agreement — every single variant (49/49) is correctly identified as deleterious by both methods.

C61G, a founder pathogenic variant in Eastern European populations (ClinVar VCV000017661), is correctly scored as non-functional by both SGE (function score = -1.738) and ESM-2.

NOTE

An important nuance: The Spearman rho within the zinc-binding set is 0.020 (non-significant). This does not mean ESM-2 performs poorly — it means all 49 variants are deleterious, and within a set of uniformly harmful mutations, rank order is not meaningful. The 100% categorical agreement is the clinically relevant result: any substitution at a zinc-coordinating position is correctly flagged.

BRCA1 RING domain mutation landscape heatmap showing ESM-2 predicted effects across all 101 positions and 20 amino acid substitutions. Eight conserved zinc-coordinating positions (C24, C27, C39, H41, C44, C47, C61, C64) show uniformly deep red — 100% categorical agreement with SGE. Global Spearman rho = 0.409.
Figure 3: ESM-2 mutation landscape for BRCA1 RING domain (101 positions × 20 amino acids). The eight zinc-coordinating positions appear as columns of uniformly deep red — every substitution is predicted deleterious. ESM-2 and SGE achieve 100% categorical agreement (49/49) at these positions.
Bar chart showing mean ESM-2 sensitivity score per position across the BRCA1 RING domain. Eight sharp deep bars at zinc-coordinating positions (C24, C27, C39, H41, C44, C47, C61, C64) stand out clearly against the background of moderate constraint.
Figure 4: Per-position ESM-2 sensitivity across BRCA1 RING domain. The eight zinc-coordinating residues produce the deepest sensitivity spikes, reflecting universal evolutionary constraint on zinc-thiolate coordination chemistry. Outside these positions, the 101 amino acid fragment provides less context for ESM-2, consistent with the modest global rho = 0.409.

PTEN: ESM-2 vs VAMP-seq Abundance

RegionNSpearman rhop-value
Global4,1120.4844.6 × 10⁻²⁴⁰
Non-active-site annotated2050.5361.3 × 10⁻¹⁶
Active-site (CX5R motif)84-0.0110.92 (not significant)

ESM-2 achieves rho = 0.484 globally, rising to 0.536 when we exclude the catalytic center. The active-site CX5R phosphatase motif (residues C124-R130) shows essentially zero correlation — and this is where the story gets interesting.

Cross-reference note: ProteinGym scores PTEN against an expanded 2021 dataset (Matreyek et al., Genome Medicine) — a different variant set than the 2018 VAMP-seq deposit we scored here, and not directly round-trippable. See the verification report for the baseline value and full provenance.

PTEN mutation landscape heatmap showing ESM-2 predicted effects for all 403 positions across 20 amino acid substitutions. Red indicates predicted harmful mutations, blue indicates tolerated. A band of lighter colors near the C-terminal tail (residues 351-403) reflects the intrinsically disordered region where ESM-2 loses predictive power. Global Spearman rho = 0.484 vs VAMP-seq.
Figure 5: ESM-2 mutation landscape for PTEN (403 positions × 20 amino acids). Red = predicted harmful, blue = tolerated. The catalytic CX5R motif (C124-R130) shows deep red — ESM-2 predicts these positions as highly constrained — but VAMP-seq abundance disagrees for catalytic-dead-but-stable variants (rho = -0.011 at the active site). The lighter band near the C-terminal tail (residues 351-403) reflects the disordered regulatory region.
Bar chart showing mean ESM-2 sensitivity score per position across PTEN. The phosphatase domain shows moderate-to-strong constraint across most positions, with the catalytic C124 and R130 producing deep sensitivity bars. The C-terminal tail (residues 351-403) shows notably shallower bars reflecting disordered character.
Figure 6: Per-position ESM-2 sensitivity across PTEN. The catalytic C124 and R130 positions (CX5R motif) register deep constraint bars — ESM-2 correctly identifies them as invariant — but this signal does not correlate with VAMP-seq abundance because catalytic-dead variants remain structurally stable. The shallower C-terminal region (residues 351-403) reflects the intrinsically disordered regulatory tail.

The Catalytic-Dead-but-Stable Problem

PTEN’s most important cancer mutations tell a story that neither ESM-2 nor VAMP-seq can tell alone.

C124S is the mutation that converts PTEN’s catalytic cysteine to serine. The protein folds normally — VAMP-seq abundance = 1.14, indistinguishable from wild-type. But the enzyme is completely dead. Zero phosphatase activity. ESM-2, reading evolutionary conservation, correctly flags C124 as deeply conserved and scores mutations as deleterious.

R130G is one of the most common somatic PTEN mutations in cancer. VAMP-seq abundance = 1.09 — again, the protein is structurally fine. But R130 is part of the catalytic CX5R motif, and the mutation abolishes function. ESM-2 correctly identifies R130 as conserved.

VariantVAMP-seq AbundanceESM-2 DirectionClinical Status
C124S1.14 (WT-like)Deleterious (conserved position)Catalytically dead, dominant-negative
R130G1.09 (WT-like)Deleterious (conserved position)Cancer hotspot, loss-of-function
R130QWT-likeDeleterious (conserved position)Cancer hotspot, loss-of-function
G129E0.76DeleteriousCancer hotspot, loss-of-function

These are the variants that matter most to clinical genetics — and they create a systematic disagreement between VAMP-seq (calls them benign-looking) and ESM-2 (calls them deleterious). The disagreement is the signal.

A Decision Framework for VUS Interpretation

For PTEN VUS, combining ESM-2 and VAMP-seq creates a 2×2 interpretation matrix:

VAMP-seq: Low abundanceVAMP-seq: WT-like abundance
ESM-2: High evolutionary constraintStructurally destabilizing (pathogenic via degradation)Catalytic-dead-but-stable (pathogenic via loss of function)
ESM-2: Low/neutral constraintStability defect ESM-2 misses (disordered regions)Likely benign

Neither tool alone covers both pathogenic mechanisms. Together, they span the landscape: VAMP-seq catches destabilization; ESM-2 catches catalytic inactivation at deeply conserved positions. This is directly relevant to ACMG PP3/BP4 computational evidence criteria ACMG criteria allowing computational variant effect predictions to count as pathogenic (PP3) or benign (BP4) evidence in clinical variant classification, per Richards et al. 2015 guidelines. Full definition .


Cross-Protein Perspective: What 5,949 Variants Tell Us

Testing across both proteins reveals a consistent pattern with one critical nuance:

ProteinAssayNGlobal rhoNon-active-site rhoActive-site rho
BRCA1 BRCTSGE1,2620.5340.772 (N=59)0.700 (N=9)
BRCA1 RINGSGE5750.4090.563 (N=31)100% categorical (N=49)
PTENVAMP-seq4,1120.4840.536 (N=205)-0.011 (N=84)

The pattern: ESM-2 performs reliably in structured domains outside the catalytic center (rho = 0.536-0.772). At active sites, performance depends on the type of constraint: universally conserved structural requirements (BRCA1 zinc-binding) produce perfect categorical agreement, while catalytic fine-tuning (PTEN CX5R) shows zero correlation with abundance.

This is consistent with what we found in the CYP2C9 analysis: conserved catalytic machinery (CYP2C9 heme-binding, rho = 0.811) shows strong ESM-2 signal, while evolvable specificity regions show weak signal. The distinction is not active-site vs. non-active-site — it is invariant constraint vs. functional exploration.

Calibration anchor: ESM-2 captures approximately 23-35% of the variance in single-mutant functional effects for these cancer proteins. Note: PTEN and BRCA1 BRCT use different correlation measures from different assays — Spearman rho for PTEN vs. VAMP-seq abundance (rho = 0.484, r² ≈ 0.23 [approximation]), and Pearson r for BRCA1 BRCT vs. SGE function scores (r = 0.594, r² ≈ 0.35). These are not directly comparable. For CYP2C9, Spearman rho² similarly approximates ~46% variance explained. The difference likely reflects protein size (CYP2C9: 490 aa full-length vs. BRCA1 BRCT: 233 aa fragment) and the depth of the evolutionary record for each protein family.


Clinical Variant Validation

BRCA1

VariantSGE ScoreSGE ClassESM-2 Concordance
C61G-1.738Non-functional✅ Correctly flagged
A1708E-2.008Non-functional✅ Correctly flagged
P1749R-3.306Non-functional✅ Correctly flagged
M1775R-1.393Non-functional✅ Correctly flagged

All four well-characterized pathogenic variants are correctly identified. The gold standard context: SGE achieves 95.9% concordance with ClinVar pathogenic and 90.9% with ClinVar benign classifications (Findlay 2018). ESM-2’s rho = 0.534 against SGE means ESM-2 ranking correlates with the same gold-standard assay — not that ESM-2 independently achieves SGE-level ClinVar concordance.

The Findlay 2018 study reclassified 256 VUS: 25% scored non-functional, approximately 66% scored functional (computed from the three-category breakdown in Figure 3c). ESM-2’s correlation with these classifications means it can provide provisional computational evidence for the thousands of BRCA1 VUS that accumulate in ClinVar each year.

PTEN

VariantVAMP-seq AbundanceClinical StatusESM-2 Concordance
C124S1.14Catalytic-dead, dominant-negative✅ Flagged (conserved position)
R130G1.09Cancer hotspot✅ Flagged (conserved position)
A79T1.00Benign (gnomAD common)✅ Not flagged
P354Q1.20Benign (gnomAD common)✅ Not flagged
S294R0.92Benign (gnomAD common)✅ Not flagged
S170R0.29Domain interface destabilization✅ Correctly flagged

ESM-2 correctly handles both pathogenic mechanisms (catalytic-dead-but-stable AND structural destabilization) and correctly identifies common benign variants. The three benign variants (A79T, P354Q, S294R) all retain WT-like VAMP-seq abundance, consistent with ESM-2’s neutral predictions.


Where This Fails

1. BRCA1 requires domain fragmentation. BRCA1 (1,863 aa) exceeds the 1,024 aa platform limit. We scored RING and BRCT as independent fragments. This is reasonable for independently-folding domains but means inter-domain effects are not captured. Full-protein evolutionary context is lost.

2. RING domain correlation is modest (rho = 0.409). The RING domain is only 101 amino acids — at the small-domain boundary where ESM-2’s attention mechanism has less sequence context. The 100% zinc-binding agreement is real, but outside the zinc sites, RING performance is weaker than BRCT.

3. PTEN active-site correlation is zero (rho = -0.011). The CX5R catalytic motif shows no rank correlation between ESM-2 scores and VAMP-seq abundance. This is not surprising — catalytic-dead-but-stable variants cluster here, creating systematic ESM-2/VAMP-seq disagreement. This is the biological basis for complementarity, not a model failure.

4. PTEN C-terminal tail (residues 351-403) is intrinsically disordered. ESM-2 performance degrades in disordered regions because evolutionary conservation patterns differ from structured domains. Do not use ESM-2 scores for PTEN variants in the C-terminal regulatory tail.

5. M1775R is barely non-functional. SGE score = -1.393, just below the -1.25 non-functional threshold. ESM-2 correctly identifies it as deleterious, but this variant sits at the edge of classification — a reminder that biological effects exist on a continuum.

6. ESM-2 captures ~35% of BRCT variance (r² ≈ 0.35, from Pearson r = 0.594) and ~23% of PTEN variance (r² ≈ 0.23, from Pearson r = 0.484). The majority of functional variation is not predicted. This is a complement to experimental data, not a replacement.

7. Gain-of-function variants are not reliably detected. PTEN dominant-negative variants like P38S (abundance = 1.14, drives increased Akt phosphorylation) may appear benign to both ESM-2 and VAMP-seq. ESM-2 is strong for loss-of-function; suspected gain-of-function requires orthogonal evidence.

8. Protein-protein binding remains a failure mode. Calmodulin, a calcium-binding signaling protein whose function depends on binding partner interactions, achieves rho = 0.212 in our 25-protein portfolio — well below the published ESM-2 ProteinGym baseline of 0.414. If your protein of interest is evaluated primarily on binding affinity, treat ESM-2 predictions with significant caution.

9. Fixed score thresholds do not generalize across proteins. A cutoff that separates deleterious from tolerated variants in BRCA1 BRCT will not transfer to PTEN or any uncharacterized protein without re-calibration. Across ProteinGym assays, threshold-based precision ranges from 0.16 to 0.62 depending on protein and assay type. Use ESM-2 as a ranking tool — compare variants within a protein against each other — and establish thresholds from orthogonal data (functional assays, ClinVar curated variants) before applying to new VUS.


How We Verified This

This analysis was conducted by an AI research agent and independently audited against primary sources by a separate validation agent. Here is the audit trail:

BRCA1

CategoryResult
Variant SGE scores checked28 representative variants
Scores verified against ProteinGym source28/28 (100%)
Corrections caught pre-publication1 (R1699W not in dataset — corrected to R1699L/G/P)
ClinVar concordance statistics verified2/2 (95.9%, 90.9%)
Audit verdictPASS

PTEN

CategoryResult
VAMP-seq abundance scores checked30 variants
Scores verified against MaveDB30/30 (100%)
Contextual corrections applied2 (abundance class annotations refined)
Clinical variant annotations verifiedAll traced to paper text
Audit verdictPASS

What the audit caught: The original BRCA1 analysis referenced R1699W as being in the SGE dataset — it is not. The dataset covers R1699L, R1699G, and R1699P but not R1699W. Corrected before this post was written.

NOTE

Why we publish the audit: Every quantitative claim in this post traces to a primary source — either the published paper, MaveDB, ProteinGym, or our cross-reference computation. We believe computational biology content should be verifiable, not just peer-reviewed. If you find a discrepancy, contact us — we’ll correct and credit.


Reproduce This

You can independently verify every result in this post.

For computational colleagues: the raw data and scoring pipeline are fully reproducible — everything below is the exact methodology used.

Step 1: Get the experimental data

# BRCA1 SGE data from ProteinGym
# Dataset: BRCA1_HUMAN_Findlay_2018
# Available at https://proteingym.org/

# PTEN VAMP-seq data from MaveDB
curl -o pten-abundance.csv \
  "https://api.mavedb.org/api/v1/score-sets/urn:mavedb:00000013-a-1/scores"

Step 2: Run ESM-2 scoring

For BRCA1: Score the RING domain (residues 1-101) and BRCT domain (residues 1631-1863) separately — each takes ~30 seconds. Request early access to NeuroAutomata to run this yourself.

For PTEN: Paste the full 403 amino acid sequence (UniProt P60484) and run a landscape scan.

Step 3: Compute correlation

Calculate Spearman rank correlation A rank correlation coefficient (−1 to +1) that measures whether two variables agree in order, not magnitude. The primary metric for variant effect benchmarks. Full definition between ESM-2 scores and experimental scores for all matched missense variants. You should get:

  • BRCA1 BRCT: rho ≈ 0.534 (±0.01)
  • BRCA1 RING: rho ≈ 0.409 (±0.01)
  • PTEN: rho ≈ 0.484 (±0.01)

Data sources:

ResourceLink
BRCA1 SGE paperFindlay et al. 2018, Nature (PMC6181777)
BRCA1 SGE data portalsge.gs.washington.edu/BRCA1/
PTEN VAMP-seq paperMatreyek et al. 2018, Nature Genetics (PMC5980760)
MaveDB The public repository for deep mutational scanning and multiplexed variant effect datasets. Each dataset gets a persistent URN for citation and reproducibility. Full definition (PTEN)urn:mavedb:00000013-a-1
ProteinGym A standardized benchmark suite for protein variant effect predictors, covering 217 deep mutational scanning assays across diverse protein families. Full definition (BRCA1)BRCA1_HUMAN_Findlay_2018
UniProt (BRCA1)P38398
UniProt (PTEN)P60484
PDB (BRCA1 RING)1JM7
PDB (BRCA1 BRCT)1T29
ESM-2 modelfacebook/esm2_t33_650M_UR50D

Try It on Your Protein

This analysis was run on NeuroAutomata, a browser-based ESM-2 A protein language model by Meta AI trained on 250 million protein sequences. Predicts how amino acid mutations affect protein function from sequence alone — no structure required. Full definition scoring platform. NeuroAutomata is currently in early access for protein engineers and researchers. Request an invite to score your own sequences — up to 1,024 amino acids, full mutation landscape in ~30 seconds, no installation required.

BRCA1 and PTEN are two of 25 proteins we’ve systematically benchmarked. We’re publishing analyses across pharmacogenomics The study of how genetic variants affect drug response — which patients metabolize drugs faster, slower, or differently due to inherited differences in drug-metabolizing enzymes. Full definition , cancer VUS Variant of Uncertain Significance — a genetic variant found in a patient that hasn't been classified as definitively pathogenic or benign. Full definition classification, and enzyme engineering — each with the same audit methodology shown above.

Research Use Only Research Use Only — a regulatory designation meaning the tool provides research scores, not clinical diagnoses. The same label used by REVEL, CADD, AlphaMissense, and PolyPhen-2. Full definition . Same designation as REVEL, CADD, AlphaMissense, and PolyPhen-2. ESM-2 scores provide computational evidence for ACMG PP3/BP4 criteria ACMG criteria allowing computational variant effect predictions to count as pathogenic (PP3) or benign (BP4) evidence in clinical variant classification, per Richards et al. 2015 guidelines. Full definition — supporting evidence in variant classification workflows, not standalone diagnostic calls. Clinical laboratories validate and incorporate computational scores under their own LDT (laboratory-developed test) workflows.


TL;DR

ESM-2 650M scores 5,949 BRCA1 and PTEN variants in under 2 minutes. Against the gold-standard assays: global rho = 0.534 (BRCA1 BRCT), 0.409 (BRCA1 RING), 0.484 (PTEN). The non-active-site BRCT result (rho = 0.772, N=59) is the second-strongest domain-level result in our 25-protein portfolio. For PTEN, ESM-2 and VAMP-seq catch different failure modes — combining them covers more clinical ground than either alone. This is ranking evidence for ACMG PP3/BP4, not a standalone classifier. Research Use Only.


What’s Next

This is Part 3 of the ESM-2 Benchmark Series. Previous posts:

Upcoming:

  • TPMT + NUDT15 — the thiopurine pathway story (two enzymes, near-identical ESM-2 performance)
  • Where protein language models fail — 14 boundary conditions A documented context where ESM-2's prediction accuracy transitions from reliable to unreliable. Boundary conditions define where to trust the model — and where to verify independently. Full definition from 25 analyses
  • The rho ≈ 0.46 ceiling — what three independent datasets reveal about ESM-2’s fundamental limit

Each post follows the same methodology: cross-reference against published experimental data, independently audit every claim, disclose limitations, and provide everything you need to reproduce the results.


Analysis by [Research Agent], independently audited by [Validation Agent], directed by Jonathan AgootAxon Agentic. All verification data available on request.

BRCA1 saturation genome editing data from Findlay et al. 2018, used with attribution via ProteinGym. PTEN VAMP-seq data from Matreyek et al. 2018, used with attribution via MaveDB. ESM-2 model by Meta AI (Lin et al. 2023).