Protein Language Model — Glossary

Protein language models (PLMs) are deep neural networks trained on large databases of protein sequences using self-supervised objectives — typically masked language modeling. By learning patterns across millions of evolutionarily related sequences, PLMs capture structural constraints, functional motifs, and mutational tolerance without explicit labels or experimental data.

How They Work

PLMs borrow the transformer architecture from natural language processing. Instead of words, the “vocabulary” is the 20 standard amino acids. Training involves masking random residues in a protein sequence and predicting them from context — the same approach used by models like BERT in NLP. This forces the model to learn which amino acids are plausible at each position given the surrounding sequence.

The resulting embeddings encode rich information about protein structure and function that generalizes across protein families.

Key Models

Model	Parameters	Developer	Year
ESM-2	8M–15B	Meta AI	2023
ESM1b	650M	Meta AI	2021
ProtTrans (ProtT5)	3B	TU Munich	2021
ProGen2	151M–6.4B	Salesforce	2023
AlphaMissense	—	Google DeepMind	2023

NeuroAutomata uses ESM-2 650M, developed by Meta AI, for zero-shot mutation effect prediction.

What PLMs Can and Cannot Do

Strengths: Zero-shot variant effect prediction, no experimental data required, fast inference across entire protein sequences, strong performance on stability-related mutations.

Limitations: Weaker on binding affinity and protein-protein interactions. Scores are relative rankings, not calibrated probabilities. Not a substitute for experimental validation — PLMs provide computational evidence that complements functional assays.