Why We Built NeuroAutomata

On this page

About This Post

Hello everyone,

Before the team’s writeup on NeuroAutomata, a note on who wrote what and how this post was produced.

NeuroAutomata is built on ESM-2 (Evolutionary Scale Modeling 2), a protein language model developed by Meta AI. I built the tool with agentic AI tools, similar to the approach I used for “Building an AI Multi-Agent System to Enable Natural Language Queries for Human Protein Atlas (HPA) data”. Further more, the HPA project is continuing being developed.

In my past roles at life sciences and biotech companies, I kept running into the same pattern: my scientist co-workers had web-based tools that needed updates, and waiting for internal IT meant launch dates slipped and marketing goals broke. Because I also code, I’d help update their apps and run the marketing alongside. That pattern is how NeuroAutomata started — a tool I’d want to hand to those same co-workers.

I’m the only human (solopreneur) building this at the time of writing, so I rely on a set of AI agents for specific roles (meet the full AI staff):

Amara — marketing director (strategy, content, outreach)
Astro — website and frontend engineering
Kiran — product manager for NeuroAutomata (product direction, software engineering, scientific research)
Veritas — independent claim verification (fact-checking against primary sources)
Folio — content authoring (blog posts, methodology and landing pages, shared glossary)
Scout — source resolution (finding open-access versions of cited papers, confirming links stay live)

Regarding AI-generated content: I adapted what I learned from the HPA project and built a verification system to reduce AI-generated slop. My take is that this is early days. LLMs have come a long way, but they are far from perfect. I take responsibility for what I publish. Every serious blog post is run through Veritas, our independent verification system AI agent. Veritas extracts claims, checks them against primary sources across numerical, clinical, and semantic layers, and blocks publication on contradictions. Claims that could not be verified from primary sources are flagged in the post or in the verification report — not silently passed. The verification report for this post is linked here.

This disclosure approach is new as of April 2026. Earlier posts predate this policy; I am applying it going forward, and both the disclosure and the underlying policy will continue to evolve.

With that, here is the team’s analysis of why we built NeuroAutomata.

ESM-2 Ranks 43rd of 95 on the ProteinGym PG_v1.3 Leaderboard. It Also Requires a GPU to Run.

You have 50 VUS in your queue. A patient review is scheduled. Your current tool is PolyPhen-2.

PolyPhen-2 was published in 2010. Protein language models that outperform it by a documented margin have existed since 2021. The problem isn’t that better tools don’t exist — it’s that running them requires a Python environment, GPU access, 2.5 GB of model weight downloads, and custom code for tokenization and output formatting. That’s not a workflow for a pharmacogenomics lab under clinical time pressure. It’s a side project.

The performance gap between legacy tools and modern protein language models is documented in the ProteinGym substitution benchmark (benchmark design Notin et al. 2023, NeurIPS) across 217 protein assays. ESM-2 ranks 43rd of 95 models on the ProteinGym PG_v1.3 leaderboard. SIFT and PolyPhen-2 were not evaluated in the ProteinGym benchmark.

NOTE

NeuroAutomata scores are Research Use Only (RUO) — the same designation carried by REVEL, CADD, AlphaMissense, and PolyPhen-2. Scores are computational evidence for research workflows, not clinical diagnoses.

The old tools dominate not because they’re better. They dominate because they’re a URL and a text box. NeuroAutomata closes that gap.

Tool	Paste-and-click?	Setup required	Accuracy tier	Price
SIFT / PolyPhen-2 (2003–2010)	Yes	None	Lower (not evaluated in ProteinGym)	Free
AlphaMissense	No	API / local install	Higher	Free
ESM-2 / Tranception (ESM-2 rank 43/95 on ProteinGym PG_v1.3)	No	Python + GPU + custom code	Higher (ESM-2: rank 43 of 95)	Free
Schrödinger / NVIDIA BioNeMo	No	Managed onboarding	Higher	Enterprise pricing
NeuroAutomata	Yes	None	ESM-2 engine (rank 43/95; 0.515 median on 5-protein validation)	Free (early access)

Paste a sequence, get masked marginal scores and 3D structure in ~30 seconds, on any laptop. See how it works.

Does It Actually Work?

Before we built the interface, we needed to know the scoring engine was reliable. We validated against two publicly available experimental datasets — and we’re disclosing both the results and the failure.

How We Tested This

Model: ESM-2 650M (Meta AI)
Dataset: ProteinGym v1.3 (Notin et al. 2023, NeurIPS) — 5-protein subset
Metric: Spearman rank correlation (ρ) vs. experimental fitness values
Version: NeuroAutomata scoring engine v1.0

Masked marginal scoring on each protein sequence. Predicted scores correlated against ground-truth fitness measurements from deep mutational scanning experiments. Category-level medians used as baselines are computed from the ProteinGym PG_v1.3 leaderboard CSV (cross-model medians per assay-category). All assays sourced from the ProteinGym v1.3 archive with published experimental protocols.

We first confirmed the engine on GB1 (protein G B1 domain), a field-standard benchmark with experimental measurements for 1,045 single mutants (Olson et al. 2014^[1]). GB1 is a known hard case for zero-shot models due to its small size and epistatic fitness landscape, so our internal validation result of rho = 0.276 (p < 10⁻¹⁹) confirmed the signal is real before we expanded the validation. With the engine confirmed, we ran across five proteins from the ProteinGym benchmark — each with thousands of experimentally measured variants, covering the functional categories most relevant to clinical variant classification.

Protein	ProteinGym assay	What’s measured	Our rho	ESM-2 category median*
Beta-lactamase (BLAT)	`BLAT_ECOLX_Stiffler_2015`	Enzymatic activity	0.731	0.417
BRCA1	`BRCA1_HUMAN_Findlay_2018`	Activity (SGE, 95.9% (162/169) ClinVar concordance)	0.515	0.417
UBC9	`UBC9_HUMAN_Weile_2017`	Expression	0.473	0.417
PTEN	`PTEN_HUMAN_Mighell_2018`	Organismal fitness	0.519	0.384
Calmodulin (CALM1)	`CALM1_HUMAN_Weile_2017`	Binding	0.212	0.326

*Cross-model category medians from the ProteinGym PG_v1.3 leaderboard CSV. Individual assay scores vary within each category. All assays browsable at proteingym.org/substitutions.

Median rho = 0.515 across 5 proteins. CLM-2026-0084Verified

NOTE

On the benchmark numbers: This 5-protein run (median rho = 0.515) is our initial engine validation — confirming the implementation works before building the product around it. The published ESM-2 650M result on the full ProteinGym suite (217 substitution DMS assays, ProteinGym v1.3 archive; benchmark design Notin et al. 2023 NeurIPS) is rho = 0.414 aggregate (mean) — rank 43 of 95 models on the ProteinGym PG_v1.3 leaderboard CSV — and is the conservative figure for broad comparisons. The ESM-2 Benchmark Series extends this validation to a curated 20-assay ProteinGym subset. These are sequential milestones — initial validation (5 proteins), published baseline (217 assays), and ongoing benchmark series (20-assay subset) — not competing claims.

What This Means for Your VUS Queue

The 5-protein validation answers the foundational question: does the scoring engine produce results consistent with experimental functional data across diverse protein families? The answer is yes, with documented failure modes you should know before relying on it.

Where it works reliably for clinical variant classification:

ESM-2 learns from evolutionary conservation across millions of protein sequences. This translates directly to structured enzymes and domains — the proteins that make up the core of clinical pharmacogenomics and cancer variant classification. BRCA1 (rho = 0.515 in the validation above, validated against SGE from Findlay et al. 2018^[2] — 95.9% (162 of 169) concordance with ClinVar pathogenic, 90.9% (20 of 22) with ClinVar benign) and PTEN (rho = 0.519) are representative of what you’d expect for structured proteins in your queue.

For CYP2C9, the first gene in the ESM-2 Benchmark Series, rho = 0.679 overall (preliminary; these CYP2C9 figures draw on a benchmark subset pending independent substrate verification — see Part 2) — with rho = 0.811 in the heme-binding domain where most known pharmacogenomic alleles sit. That domain-level signal is strong enough to contribute as ACMG PP3/BP4 computational evidence for variants without prior functional data.

CYP2C9*3 (I359L) caveat: *3 is the most clinically significant warfarin allele — responsible for approximately 80% reduction in CYP2C9 activity. ESM-2 flags it directionally (negative score) but it falls in the uncertain zone. *3 is located in SRS5, the substrate recognition site where evolutionary conservation is weakest. Domain performance maps, not fixed thresholds, are how you use this signal. See Part 2: CYP2C9 benchmark for the full allele-by-allele analysis.

Two clinical failure modes to know:

Activity-without-abundance variants (a subset of pharmacogenomic deficiency alleles, including characterized examples across TPMT, NUDT15, CYP2C9, and CYP2C19): ESM-2 scores stability and conservation — not catalytic activity directly. Variants like TPMT*5 (L49S) and CYP2C19*6 (R132Q) retain wild-type protein abundance but have no function. VAMP-seq catches these; ESM-2 does not. For thiopurine and clopidogrel dosing variants specifically, treat ESM-2 scores as one input alongside expression data, not as a standalone classifier.

Intrinsically disordered regions: TP53’s N-terminal transactivation domain (residues 1–97) and regulatory region (357–393) are not evolutionarily conserved in sequence — conservation of function doesn’t require conservation of sequence in disordered regions. ESM-2 rho approaches zero or negative in those regions. For TP53 variants in ordered domains (DNA-binding, tetramerization), the signal returns. Check the domain annotation before interpreting scores on any partially disordered protein — disordered sequence tends to carry the thin evolutionary signal where ESM-2 is least reliable, the same applicability-domain limit we benchmark across proteins.

ESM-2 scores are best used for ranking variants by predicted impact, not for binary damaging/benign classification. Fixed score thresholds do not generalize across protein families — precision varies from 0.62 (CYP2C9 activity) to 0.16 (NUDT15) at the same cutoff. The Benchmark Series provides protein-specific guidance instead.

The ESM-2 Benchmark Series applies this protein-by-protein analysis to pharmacogenes and cancer VUS genes — domain-level performance maps, known failure positions, and guidance on where scores can contribute to ACMG criteria.

Part 1: Scoring 6,142 CYP2C9 Variants in 30 Seconds — Spearman rho = 0.679 overall (preliminary), with a ~2× performance gap between the heme-binding domain (rho = 0.811) and the substrate recognition site (SRS5, rho = 0.422).

Try It

NeuroAutomata runs ESM-2 650M masked marginal scoring on any protein sequence you paste — full mutation landscape scan in ~30 seconds. Early access is currently invite-only while we validate with our first cohort of researchers. Request early access.

If you’re working on a pharmacogene or cancer VUS gene with published DMS data and want to see a domain-level benchmark analysis in the series, get in touch.

TL;DR

ESM-2 650M ranks 43rd of 95 on the ProteinGym PG_v1.3 leaderboard (benchmark design Notin et al. 2023 NeurIPS) — a well-documented, peer-reviewed baseline.
NeuroAutomata wraps ESM-2 in a browser interface: paste a sequence, get ranked variant predictions in ~30 seconds, no Python or GPU required.
Internal validation across 5 proteins: median Spearman rho = 0.515. Calmodulin fails (rho = 0.212) — disclosed, not hidden.
Use it for ranking, not binary classification. Fixed thresholds don’t generalize. Protein-specific domain maps do.
CYP2C9 heme-binding domain: rho = 0.811 (preliminary). CYP2C9 SRS5 substrate site: rho = 0.422 (preliminary). Know which region you’re in.
Research Use Only — same designation as REVEL, CADD, AlphaMissense, PolyPhen-2. Not a clinical diagnostic.

Data

Validation results referenced in this article. Filter by protein or dataset. Full dataset and CSV download available at /research/validation.

Protein	Assay	Dataset	ρ	ESM-2 Baseline	Notes
Beta-lactamase(BLAT)	Enzymatic activity BLAT_ECOLX_Stiffler_2015 Stiffler et al. 2015	5-Protein Benchmark	0.731	0.417
BRCA1	SGE activity BRCA1_HUMAN_Findlay_2018 Findlay et al. 2018	5-Protein Benchmark	0.515	0.417
Calmodulin(CALM1)	Binding CALM1_HUMAN_Weile_2017 Weile et al. 2017	5-Protein Benchmark	0.212	0.414	Well below ESM-2 650M's own published aggregate — confirms protein-protein binding as a known weak category
PTEN	Organismal fitness PTEN_HUMAN_Mighell_2018 Mighell et al. 2018	5-Protein Benchmark	0.519	0.384
UBC9	Expression UBC9_HUMAN_Weile_2017 Weile et al. 2017	5-Protein Benchmark	0.473	0.417
Protein G B1 domain(GB1)	Fitness (single mutants) Olson et al. 2014 n = 1,045	Pre-validation	0.276	—	Pre-benchmark signal check. p < 10⁻¹⁹. Not included in 5-protein median.
5-protein benchmark (aggregate)	Median across all 5 assays Internal validation	Aggregate	0.515	—	Median across the 5-protein internal validation subset

No results match your filter.

ρ values in amber are below the ESM-2 category baseline. Baselines: cross-model category medians from the ProteinGym leaderboard CSV (accessed 2026-05-08).