Purifying Selection

Published

Also known as: negative selection, stabilizing selection, evolutionary constraint

Evolutionary pressure that removes harmful mutations from a population over time, causing functionally critical positions to remain conserved across species.

Source: Kimura M. The Neutral Theory of Molecular Evolution. Cambridge University Press, 1983.

Purifying selection (also called negative selection) is the evolutionary process that removes mutations that reduce an organism’s fitness. Over millions of years, positions under strong purifying selection show little variation across species — the wild-type amino acid is overwhelmingly conserved.

Why It Matters for Protein Language Models

Protein language models like ESM-2 are trained on evolutionary sequences. They learn which amino acids are conserved across species (under purifying selection) and which are variable (under diversifying or neutral selection). This is why ESM-2’s masked marginal scores correlate with functional importance:

  • High purifying selection → conserved position → ESM-2 assigns high evolutionary cost to mutations → strong correlation with experimental fitness
  • Low purifying selection → variable position → ESM-2 is uncertain → weak correlation with experimental fitness

The CYP2C9 Example

The CYP450 heme-binding domain has been under extreme purifying selection for ~2 billion years — the heme-thiolate coordination chemistry is identical in bacterial and human P450s. ESM-2 captures this deep constraint, achieving rho = 0.811 at heme-binding positions.

Contrast with the substrate recognition site SRS5: these positions have evolved diversely across the CYP superfamily to accommodate different substrates in different organisms. ESM-2 sees this evolutionary diversity as tolerance and achieves only rho = 0.422 at SRS5.

Evolutionary Conservation vs. Functional Importance

An important nuance: evolutionary conservation does not always equal functional importance for a specific protein. A position may be conserved because:

  1. Changing it is always damaging (true purifying selection)
  2. The ancestral function is shared across all homologs

Positions that are “important” in one specific protein context — like residues fine-tuning substrate specificity — may not be conserved if other homologs use different amino acids for the same functional role.