ProteinGym
Also known as: ProteinGym benchmark, ProteinGym substitution benchmark
A standardized benchmark suite for protein variant effect predictors, covering 217 deep mutational scanning assays across diverse protein families.
Source: Notin P et al. 'ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design.' NeurIPS 2023. https://doi.org/10.48550/arXiv.2305.15757
Primary reference ↗ProteinGym is a standardized benchmark for evaluating protein variant effect predictors. It compiles DMS datasets across 217 assays from diverse protein families and provides a consistent evaluation framework — the same held-out data, the same correlation metrics, the same train/test split for all methods.
What It Measures
ProteinGym evaluates predictors on:
- Substitution benchmark: single amino acid substitutions (217 assays, ~2.5 million variants)
- Indel benchmark: insertions and deletions (smaller dataset)
Published ESM-2 Performance
On the full 217-assay substitution benchmark, ESM-2 650M achieves a mean Spearman rho of 0.414 (rank 45 of 97 models on the live leaderboard, accessed 2026-05-08). This is the number to cite when comparing any ESM-2 implementation to the published model. Source: ProteinGym leaderboard CSV (OATML-Markslab/ProteinGym).
Our 20-Assay Benchmark
Our ESM-2 Benchmark Series uses a 20-assay curated subset selected to cover five functional categories (activity, stability, binding, expression, organismal fitness). Our implementation achieves a median Spearman rho of 0.487 on this subset — approximately 18% above the published ESM-2 650M aggregate of 0.414.
Note: Our +18% is on the curated 20-assay subset, not the full 217-assay ProteinGym suite.
Accessing Data
ProteinGym data is publicly available at proteingym.org and via the GitHub repository.