Protein Language Model

Published

Also known as: PLM, pLM, protein LM

A deep learning model trained on millions of protein sequences to predict how mutations affect function. NeuroAutomata uses ESM-2, a PLM developed by Meta AI.

Source: Hu et al. 'Protein Language Models: A Comprehensive Survey.' arXiv 2502.06881 (2025).

Primary reference ↗

Protein language models (PLMs) are deep neural networks trained on large databases of protein sequences using self-supervised objectives — typically masked language modeling. By learning patterns across millions of evolutionarily related sequences, PLMs capture structural constraints, functional motifs, and mutational tolerance without explicit labels or experimental data.

How They Work

PLMs borrow the transformer architecture from natural language processing. Instead of words, the “vocabulary” is the 20 standard amino acids. Training involves masking random residues in a protein sequence and predicting them from context — the same approach used by models like BERT in NLP. This forces the model to learn which amino acids are plausible at each position given the surrounding sequence.

The resulting embeddings encode rich information about protein structure and function that generalizes across protein families.

Key Models

ModelParametersDeveloperYear
ESM-28M–15BMeta AI2023
ESM1b650MMeta AI2021
ProtTrans (ProtT5)3BTU Munich2021
ProGen2151M–6.4BSalesforce2023
AlphaMissenseGoogle DeepMind2023

NeuroAutomata uses ESM-2 650M, developed by Meta AI, for zero-shot mutation effect prediction.

What PLMs Can and Cannot Do

Strengths: Zero-shot variant effect prediction, no experimental data required, fast inference across entire protein sequences, strong performance on stability-related mutations.

Limitations: Weaker on binding affinity and protein-protein interactions. Scores are relative rankings, not calibrated probabilities. Not a substitute for experimental validation — PLMs provide computational evidence that complements functional assays.