Why We Built NeuroAutomata
On this page
About This Post
Hello everyone,
Before the team’s writeup on NeuroAutomata, a note on who wrote what and how this post was produced.
NeuroAutomata is built on ESM-2 (Evolutionary Scale Modeling 2), a protein language model developed by Meta AI. I built the tool with agentic AI tools, similar to the approach I used for “Building an AI Multi-Agent System to Enable Natural Language Queries for Human Protein Atlas (HPA) data”. Further more, the HPA project is continuing being developed.
In my past roles at life sciences and biotech companies, I kept running into the same pattern: my scientist co-workers had web-based tools that needed updates, and waiting for internal IT meant launch dates slipped and marketing goals broke. Because I also code, I’d help update their apps and run the marketing alongside. That pattern is how NeuroAutomata started — a tool I’d want to hand to those same co-workers.
I’m the only human (solopreneur) building this at the time of writing, so I rely on a set of AI agents for specific roles (meet the full AI staff):
- Amara — marketing director (strategy, content, outreach)
- Astro — website and frontend engineering
- Kiran — product manager for NeuroAutomata (product direction, software engineering, scientific research)
- Veritas — independent claim verification (fact-checking against primary sources)
- Folio — content authoring (blog posts, methodology and landing pages, shared glossary)
- Scout — source resolution (finding open-access versions of cited papers, confirming links stay live)
Regarding AI-generated content: I adapted what I learned from the HPA project and built a verification system to reduce AI-generated slop. My take is that this is early days. LLMs have come a long way, but they are far from perfect. I take responsibility for what I publish. Every serious blog post is run through Veritas, our independent verification system AI agent. Veritas extracts claims, checks them against primary sources across numerical, clinical, and semantic layers, and blocks publication on contradictions. Claims that could not be verified from primary sources are flagged in the post or in the verification report — not silently passed. The verification report for this post is linked here.
This disclosure approach is new as of April 2026. Earlier posts predate this policy; I am applying it going forward, and both the disclosure and the underlying policy will continue to evolve.
With that, here is the team’s analysis of why we built NeuroAutomata.
ESM-2 Ranks 45th of 97 on the Live ProteinGym Leaderboard. It Also Requires a GPU to Run.
You have 50 VUS Variant of Uncertain Significance — a genetic variant found in a patient that hasn't been classified as definitively pathogenic or benign. Full definition in your queue. A patient review is scheduled. Your current tool is PolyPhen-2.
PolyPhen-2 was published in 2010. Protein language models A deep learning model trained on millions of protein sequences to predict how mutations affect function. NeuroAutomata uses ESM-2, a PLM developed by Meta AI. Full definition that outperform it by a documented margin have existed since 2021. The problem isn’t that better tools don’t exist — it’s that running them requires a Python environment, GPU access, 2.5 GB of model weight downloads, and custom code for tokenization and output formatting. That’s not a workflow for a pharmacogenomics The study of how genetic variants affect drug response — which patients metabolize drugs faster, slower, or differently due to inherited differences in drug-metabolizing enzymes. Full definition lab under clinical time pressure. It’s a side project.
The performance gap between legacy tools and modern protein language models A deep learning model trained on millions of protein sequences to predict how mutations affect function. NeuroAutomata uses ESM-2, a PLM developed by Meta AI. Full definition is documented in the ProteinGym A standardized benchmark suite for protein variant effect predictors, covering 217 deep mutational scanning assays across diverse protein families. Full definition substitution benchmark (benchmark design Notin et al. 2023, NeurIPS) across 217 protein assays. ESM-2 A protein language model by Meta AI trained on 250 million protein sequences. Predicts how amino acid mutations affect protein function from sequence alone — no structure required. Full definition ranks 45th of 97 models on the live leaderboard. SIFT and PolyPhen-2 were not evaluated in the ProteinGym benchmark.
NOTE
NeuroAutomata scores are Research Use Only (RUO) Research Use Only — a regulatory designation meaning the tool provides research scores, not clinical diagnoses. The same label used by REVEL, CADD, AlphaMissense, and PolyPhen-2. Full definition — the same designation carried by REVEL, CADD, AlphaMissense, and PolyPhen-2. Scores are computational evidence for research workflows, not clinical diagnoses.
The old tools dominate not because they’re better. They dominate because they’re a URL and a text box. NeuroAutomata closes that gap.
| Tool | Paste-and-click? | Setup required | Accuracy tier | Price |
|---|---|---|---|---|
| SIFT / PolyPhen-2 (2003–2010) | Yes | None | Lower (not evaluated in ProteinGym) | Free |
| AlphaMissense | No | API / local install | Higher | Free |
| ESM-2 / Tranception (rank 45–46/97 on live ProteinGym leaderboard) | No | Python + GPU + custom code | Higher (ESM-2: rank 45 of 97) | Free |
| Schrödinger / NVIDIA BioNeMo | No | Managed onboarding | Higher | Enterprise pricing |
| NeuroAutomata | Yes | None | ESM-2 engine (rank 45/97, rho 0.414 published baseline (leaderboard CSV, 2026-05-08); 0.515 median on 5-protein validation) | Free (early access) |
Paste a sequence, get masked marginal scores ESM-2's zero-shot method for scoring variant effects. It masks each position and measures how surprising the mutant amino acid is relative to wild-type. Full definition and 3D structure in ~30 seconds, on any laptop. See how it works.
Does It Actually Work?
Before we built the interface, we needed to know the scoring engine was reliable. We validated against two publicly available experimental datasets — and we’re disclosing both the results and the failure.
How We Tested This
- Model
- ESM-2 650M (Meta AI)
- Dataset
- ProteinGym v1.0 (Notin et al. 2023, NeurIPS) — 5-protein subset
- Metric
- Spearman rank correlation (ρ) vs. experimental fitness values
- Version
- NeuroAutomata scoring engine v1.0
Masked marginal scoring on each protein sequence. Predicted scores correlated against ground-truth fitness measurements from deep mutational scanning experiments. Category-level medians used as baselines are computed from the live ProteinGym leaderboard CSV (cross-model medians per assay-category, accessed 2026-05-08). All assays sourced from the ProteinGym v1.3 archive with published experimental protocols.
We first confirmed the engine on GB1 (protein G B1 domain), a field-standard benchmark with experimental measurements for 1,045 single mutants (Olson et al. 2014 [1] Comprehensive biophysical characterization of the effects of all amino acid mutations in GB1 doi:10.1016/j.cub.2014.09.072 ). GB1 is a known hard case for zero-shot models due to its small size and epistatic fitness landscape, so our internal validation result of rho = 0.276 (p < 10⁻¹⁹) confirmed the signal is real before we expanded the validation. With the engine confirmed, we ran across five proteins from the ProteinGym A standardized benchmark suite for protein variant effect predictors, covering 217 deep mutational scanning assays across diverse protein families. Full definition benchmark — each with thousands of experimentally measured variants, covering the functional categories most relevant to clinical variant classification.
| Protein | ProteinGym assay | What’s measured | Our rho | ESM-2 category median* |
|---|---|---|---|---|
| Beta-lactamase (BLAT) | BLAT_ECOLX_Stiffler_2015 | Enzymatic activity | 0.731 | 0.420 |
| BRCA1 | BRCA1_HUMAN_Findlay_2018 | Activity (SGE, 95.9% ClinVar concordance) | 0.515 | 0.420 |
| UBC9 | UBC9_HUMAN_Weile_2017 | Expression | 0.473 | 0.418 |
| PTEN | PTEN_HUMAN_Mighell_2018 | Organismal fitness | 0.519 | 0.384 |
| Calmodulin (CALM1) | CALM1_HUMAN_Weile_2017 | Binding | 0.212 | 0.329 |
*Cross-model category medians from the live ProteinGym leaderboard CSV (accessed 2026-05-08). Individual assay scores vary within each category. All assays browsable at proteingym.org/substitutions.
Median rho = 0.515 across 5 proteins — 24% above the published ESM-2 650M baseline (0.414, leaderboard CSV, accessed 2026-05-08). CLM-2026-0084 Verified CLM-2026-0085 Verified
NOTE
On the benchmark numbers: This 5-protein run (median rho = 0.515, ~24% above the published ESM-2 650M aggregate of 0.414) is our initial engine validation — confirming the implementation works before building the product around it. The published ESM-2 650M result on the full ProteinGym suite (217 substitution DMS assays, ProteinGym v1.3 archive; benchmark design Notin et al. 2023 NeurIPS) is rho = 0.414 aggregate (mean) — rank 45 of 97 models on the live leaderboard CSV (accessed 2026-05-08) — and is the conservative figure for broad comparisons. The ESM-2 Benchmark Series extends this validation to a curated 20-assay ProteinGym subset and achieves a median of 0.487 (~18% above the published ESM-2 650M aggregate of 0.414). These are sequential milestones — initial validation (5 proteins), published baseline (217 assays), and ongoing benchmark series (20-assay subset) — not competing claims.
What This Means for Your VUS Queue
The 5-protein validation answers the foundational question: does the scoring engine produce results consistent with experimental functional data across diverse protein families? The answer is yes, with documented failure modes you should know before relying on it.
Where it works reliably for clinical variant classification:
ESM-2 learns from evolutionary conservation across millions of protein sequences. This translates directly to structured enzymes and domains — the proteins that make up the core of clinical pharmacogenomics and cancer variant classification. BRCA1 (rho = 0.515 in the validation above, validated against SGE A CRISPR-based assay that introduces every possible single-nucleotide variant at a genomic locus in its native context, then measures functional impact through cell viability or other selection. The gold standard for clinical variant classification in cancer genes. Full definition from Findlay et al. 2018 [2] Accurate classification of BRCA1 variants with saturation genome editing doi:10.1038/s41586-018-0461-z — 95.9% concordance with ClinVar pathogenic, 90.9% with ClinVar benign) and PTEN (rho = 0.519) are representative of what you’d expect for structured proteins in your queue.
For CYP2C9, the first gene in the ESM-2 Benchmark Series, rho = 0.679 overall — with rho = 0.811 in the heme An iron-containing cofactor at the catalytic center of CYP450 enzymes. The iron coordinates to the protein via a conserved cysteine, forming a heme-thiolate bond universal across the CYP superfamily. Full definition -binding domain where most known pharmacogenomic alleles sit. That domain-level signal is strong enough to contribute as ACMG PP3/BP4 ACMG criteria allowing computational variant effect predictions to count as pathogenic (PP3) or benign (BP4) evidence in clinical variant classification, per Richards et al. 2015 guidelines. Full definition computational evidence for variants without prior functional data.
CYP2C9*3 (I359L) caveat: *3 is the most clinically significant warfarin allele — responsible for approximately 80% reduction in CYP2C9 activity. ESM-2 flags it directionally (negative score) but it falls in the uncertain zone. *3 is located in SRS5, the substrate recognition site where evolutionary conservation is weakest. Domain performance maps, not fixed thresholds, are how you use this signal. See Part 2: CYP2C9 benchmark for the full allele-by-allele analysis.
Two clinical failure modes to know:
Activity-without-abundance variants (a subset of pharmacogenomic deficiency alleles, including characterized examples across TPMT, NUDT15, CYP2C9, and CYP2C19): ESM-2 scores stability and conservation — not catalytic activity directly. Variants like TPMT*5 (L49S) and CYP2C19*6 (R132Q) retain wild-type protein abundance but have no function. VAMP-seq A high-throughput assay that measures protein abundance (expression and stability) for thousands of variants simultaneously using fluorescent protein fusions and flow cytometry. Full definition catches these; ESM-2 does not. For thiopurine and clopidogrel dosing variants specifically, treat ESM-2 scores as one input alongside expression data, not as a standalone classifier.
Intrinsically disordered regions: TP53’s N-terminal transactivation domain (residues 1–97) and regulatory region (357–393) are not evolutionarily conserved in sequence — conservation of function doesn’t require conservation of sequence in disordered regions. ESM-2 rho approaches zero or negative in those regions. For TP53 variants in ordered domains (DNA-binding, tetramerization), the signal returns. Check the domain annotation before interpreting scores on any partially disordered protein.
ESM-2 scores are best used for ranking variants by predicted impact, not for binary damaging/benign classification. Fixed score thresholds do not generalize across protein families — precision varies from 0.62 (CYP2C9 activity) to 0.16 (NUDT15) at the same cutoff. The Benchmark Series provides protein-specific guidance instead.
The ESM-2 Benchmark Series applies this protein-by-protein analysis to pharmacogenes and cancer VUS genes — domain-level performance maps, known failure positions, and guidance on where scores can contribute to ACMG criteria ACMG criteria allowing computational variant effect predictions to count as pathogenic (PP3) or benign (BP4) evidence in clinical variant classification, per Richards et al. 2015 guidelines. Full definition .
Part 1: Scoring 6,142 CYP2C9 Variants in 30 Seconds — Spearman rho A rank correlation coefficient (−1 to +1) that measures whether two variables agree in order, not magnitude. The primary metric for variant effect benchmarks. Full definition = 0.679 overall, with a ~2× performance gap between the heme-binding domain (rho = 0.811) and the substrate recognition site (SRS5, rho = 0.422).
Try It
NeuroAutomata runs ESM-2 650M masked marginal scoring ESM-2's zero-shot method for scoring variant effects. It masks each position and measures how surprising the mutant amino acid is relative to wild-type. Full definition on any protein sequence you paste — full mutation landscape scan in ~30 seconds. Early access is currently invite-only while we validate with our first cohort of researchers. Request early access.
If you’re working on a pharmacogene or cancer VUS gene with published DMS A lab technique that measures the functional effect of every possible single amino acid substitution across a protein. The gold standard for variant effect data. Full definition data and want to see a domain-level benchmark analysis in the series, get in touch.
TL;DR
- ESM-2 650M ranks 45th of 97 on the live ProteinGym leaderboard (ProteinGym v1.3 archive, accessed 2026-05-08; benchmark design Notin et al. 2023 NeurIPS) — a well-documented, peer-reviewed baseline.
- NeuroAutomata wraps ESM-2 in a browser interface: paste a sequence, get ranked variant predictions in ~30 seconds, no Python or GPU required.
- Internal validation across 5 proteins: median Spearman rho = 0.515. Calmodulin fails (rho = 0.212) — disclosed, not hidden.
- Use it for ranking, not binary classification. Fixed thresholds don’t generalize. Protein-specific domain maps do.
- CYP2C9 heme-binding domain: rho = 0.811. CYP2C9 SRS5 substrate site: rho = 0.422. Know which region you’re in.
- Research Use Only Research Use Only — a regulatory designation meaning the tool provides research scores, not clinical diagnoses. The same label used by REVEL, CADD, AlphaMissense, and PolyPhen-2. Full definition — same designation as REVEL, CADD, AlphaMissense, PolyPhen-2. Not a clinical diagnostic.
Data
Validation results referenced in this article. Filter by protein or dataset. Full dataset and CSV download available at /research/validation.
| Protein | Assay | Dataset | ρ | ESM-2 Baseline | Notes |
|---|---|---|---|---|---|
| Beta-lactamase (BLAT) | Enzymatic activity | 5-Protein Benchmark | 0.731 | 0.420 | |
| BRCA1 | SGE activity | 5-Protein Benchmark | 0.515 | 0.420 | |
| UBC9 | Expression | 5-Protein Benchmark | 0.473 | 0.418 | |
| PTEN | Organismal fitness | 5-Protein Benchmark | 0.519 | 0.384 | |
| Calmodulin (CALM1) | Binding | 5-Protein Benchmark | 0.212 | 0.414 | Well below ESM-2 650M's own published aggregate — confirms protein-protein binding as a known weak category |
| Protein G B1 domain (GB1) | Fitness (single mutants) | Pre-validation | 0.276 | — | Pre-benchmark signal check. p < 10⁻¹⁹. Not included in 5-protein median. |
| 5-protein benchmark (aggregate) | Median across all 5 assays | Aggregate | 0.515 | — | 24% above the published ESM-2 650M aggregate of 0.414 (ProteinGym leaderboard CSV, accessed 2026-05-08) |
| ESM-2 Benchmark Series | Curated 20-assay ProteinGym subset | Aggregate | 0.487 | 0.414 | ~18% above the published ESM-2 650M aggregate of 0.414 |
No results match your filter.
ρ values in amber are below the ESM-2 category baseline. Baselines: cross-model category medians from the ProteinGym leaderboard CSV (accessed 2026-05-08).