Glossary
Published · Updated
Definitions for AI, bioinformatics, and life sciences terms used across blog posts and research.
41 terms
AI / ML
- Boundary Condition prediction boundary, model boundary A documented context where ESM-2's prediction accuracy transitions from reliable to unreliable. Boundary conditions define where to trust the model — and where to verify independently.
- Context Window context length, context size An AI model's short-term working memory — the maximum amount of text it can hold and process at once before older content is dropped.
- LLM Large Language Model, language model Large Language Model — an AI model trained on vast text that can understand and generate human language. Examples: GPT-5, Claude, Gemini.
- Masked Marginal Scoring masked marginals, zero-shot scoring ESM-2's zero-shot method for scoring variant effects. It masks each position and measures how surprising the mutant amino acid is relative to wild-type.
- Model Context Protocol MCP, Model Context Protocol An open standard that lets AI models connect to external tools and data sources — databases, APIs, file systems — through a universal interface without custom integration code.
- Multi-Agent System MAS, multi-agent architecture A network of specialized, autonomous AI agents that collaborate and divide complex tasks among themselves to solve problems too difficult for a single model.
- RAG Retrieval-Augmented Generation, retrieval augmented generation Retrieval-Augmented Generation — a technique where an AI retrieves relevant documents from a database before generating a response, grounding answers in real data to reduce hallucinations.
- Tokens token, tokenization The basic units of text an AI processes — roughly a word or word fragment. A sentence of ~10 words is about 13–15 tokens.
Bioinformatics
- Deep Mutational Scanning DMS, deep mutational scan A lab technique that measures the functional effect of every possible single amino acid substitution across a protein. The gold standard for variant effect data.
- ESM-2 ESM2, Evolutionary Scale Modeling 2 A protein language model by Meta AI trained on 250 million protein sequences. Predicts how amino acid mutations affect protein function from sequence alone — no structure required.
- MaveDB Multiplexed Assay of Variant Effect Database The public repository for deep mutational scanning and multiplexed variant effect datasets. Each dataset gets a persistent URN for citation and reproducibility.
- Protein Language Model PLM, pLM A deep learning model trained on millions of protein sequences to predict how mutations affect function. NeuroAutomata uses ESM-2, a PLM developed by Meta AI.
- ProteinGym ProteinGym benchmark, ProteinGym substitution benchmark A standardized benchmark suite for protein variant effect predictors, covering 217 deep mutational scanning assays across diverse protein families.
- Purifying Selection negative selection, stabilizing selection Evolutionary pressure that removes harmful mutations from a population over time, causing functionally critical positions to remain conserved across species.
- Saturation Genome Editing (SGE) SGE, saturation genome editing A CRISPR-based assay that introduces every possible single-nucleotide variant at a genomic locus in its native context, then measures functional impact through cell viability or other selection. The gold standard for clinical variant classification in cancer genes.
- Saturation Mutagenesis complete mutagenesis, exhaustive mutagenesis Creating every possible single amino acid substitution at every position in a protein — the prerequisite for deep mutational scanning.
- VAMP-seq Variant Abundance by Massively Parallel sequencing A high-throughput assay that measures protein abundance (expression and stability) for thousands of variants simultaneously using fluorescent protein fusions and flow cytometry.
Proteomics
- Fold Enrichment fold-enrichment ratio, tissue enrichment ratio The ratio of a protein's expression in a target tissue vs. its highest expression in any other tissue. 4x means 4 times more abundant in the target tissue than anywhere else.
- Heme heme group, heme cofactor An iron-containing cofactor at the catalytic center of CYP450 enzymes. The iron coordinates to the protein via a conserved cysteine, forming a heme-thiolate bond universal across the CYP superfamily.
- Receptor-ligand interface ligand-binding interface, receptor binding site The residues on a receptor protein that make direct contact with its cognate ligand; on receptor-tyrosine-kinase ectodomains and other binding-driven receptors, these residues are conserved for ligand recognition rather than for substrate-pocket geometry.
- Tau Score τ score, tissue specificity score A tissue specificity metric (0–1000) used by the Human Protein Atlas. Higher values indicate expression restricted to fewer tissues. τ ≥ 100 is considered tissue-enriched.
Genomics
- ACMG PP3/BP4 PP3, BP4 ACMG criteria allowing computational variant effect predictions to count as pathogenic (PP3) or benign (BP4) evidence in clinical variant classification, per Richards et al. 2015 guidelines.
- ClinVar ClinVar database Public archive of human genetic variants and their reported clinical significance, hosted by NCBI; the primary reference for variant pathogenicity classifications in clinical genomic medicine.
- nTPM normalized TPM, normalized Transcripts Per Million Normalized Transcripts Per Million — an RNA expression unit used by the Human Protein Atlas. Reflects how many transcripts of a gene are present per million total transcripts in a sample.
- Pharmacogenomics PGx, pharmacogenetics The study of how genetic variants affect drug response — which patients metabolize drugs faster, slower, or differently due to inherited differences in drug-metabolizing enzymes.
- VUS variant of uncertain significance, variant of unknown significance Variant of Uncertain Significance — a genetic variant found in a patient that hasn't been classified as definitively pathogenic or benign.
Statistics
General