ProteinGym

Published · Updated

Also known as: ProteinGym benchmark, ProteinGym substitution benchmark

A standardized benchmark suite for protein variant effect predictors, covering 217 deep mutational scanning assays across diverse protein families.

Source: Notin P et al. 'ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design.' NeurIPS 2023. https://doi.org/10.48550/arXiv.2305.15757

Primary reference ↗

ProteinGym is a standardized benchmark for evaluating protein variant effect predictors. It compiles DMS datasets across 217 assays from diverse protein families and provides a consistent evaluation framework — the same held-out data, the same correlation metrics, the same train/test split for all methods.

What It Measures

ProteinGym evaluates predictors on:

  • Substitution benchmark: single amino acid substitutions (217 assays, ~2.5 million variants)
  • Indel benchmark: insertions and deletions (smaller dataset)

Published ESM-2 Performance

On the full 217-assay substitution benchmark, ESM-2 650M achieves a mean Spearman rho of 0.414 (rank 45 of 97 models on the live leaderboard, accessed 2026-05-08). This is the number to cite when comparing any ESM-2 implementation to the published model. Source: ProteinGym leaderboard CSV (OATML-Markslab/ProteinGym).

Our 20-Assay Benchmark

Our ESM-2 Benchmark Series uses a 20-assay curated subset selected to cover five functional categories (activity, stability, binding, expression, organismal fitness). Our implementation achieves a median Spearman rho of 0.487 on this subset — approximately 18% above the published ESM-2 650M aggregate of 0.414.

Note: Our +18% is on the curated 20-assay subset, not the full 217-assay ProteinGym suite.

Accessing Data

ProteinGym data is publicly available at proteingym.org and via the GitHub repository.