Tokens

Published

Also known as: token, tokenization

The basic units of text an AI processes — roughly a word or word fragment. A sentence of ~10 words is about 13–15 tokens.

Source: Common AI/ML terminology

Primary reference ↗

Tokens are the fundamental units that large language models use to process and generate text. Rather than working character-by-character or word-by-word, LLMs split text into subword pieces called tokens using algorithms like Byte Pair Encoding (BPE).

Rough Token Counts

TextApproximate Tokens
1 word (English)~1.3 tokens
1 sentence~15–20 tokens
1 paragraph~80–100 tokens
1 page (~500 words)~650 tokens
This glossary entry~300 tokens

Why Tokens Matter for Biological Research

LLM API costs are priced per token (input and output). For a system querying large biological databases:

  • A single HPA gene record: ~200–500 tokens
  • 50 gene records passed to a model: ~10,000–25,000 tokens
  • Running 12 benchmark tests: ~$0.09–$0.19 per query at current GPT-5 pricing

Token efficiency — how much useful data you can pack into a context window — directly determines both the cost and the quality of AI-driven biological analysis.