Tokens
Published
Also known as: token, tokenization
The basic units of text an AI processes — roughly a word or word fragment. A sentence of ~10 words is about 13–15 tokens.
Source: Common AI/ML terminology
Primary reference ↗Tokens are the fundamental units that large language models use to process and generate text. Rather than working character-by-character or word-by-word, LLMs split text into subword pieces called tokens using algorithms like Byte Pair Encoding (BPE).
Rough Token Counts
| Text | Approximate Tokens |
|---|---|
| 1 word (English) | ~1.3 tokens |
| 1 sentence | ~15–20 tokens |
| 1 paragraph | ~80–100 tokens |
| 1 page (~500 words) | ~650 tokens |
| This glossary entry | ~300 tokens |
Why Tokens Matter for Biological Research
LLM API costs are priced per token (input and output). For a system querying large biological databases:
- A single HPA gene record: ~200–500 tokens
- 50 gene records passed to a model: ~10,000–25,000 tokens
- Running 12 benchmark tests: ~$0.09–$0.19 per query at current GPT-5 pricing
Token efficiency — how much useful data you can pack into a context window — directly determines both the cost and the quality of AI-driven biological analysis.