The development of large language models is no longer measured only in computing power and billions of parameters – but in tokens, the tiny fragments of text that form the basis of linguistic understanding. One trillion tokens equal roughly four terabytes of raw text, illustrating the enormous scale of data today’s AI systems process.
From tokens to text
A token can represent a whole word, part of a word, or even a single character, depending on the language and the model. Roughly speaking, one token corresponds to about 0.75 words or around four characters of text. This means that a model trained on one trillion tokens has processed approximately 750 billion words – or about four terabytes of uncompressed raw text.
Modern AI models are therefore built on datasets that far exceed any traditional linguistic corpus, approaching the scale of the entire internet in scope.
Known data points from research sources
OpenAI’s GPT-3 was trained on roughly 300 billion tokens, equivalent to about 1.2 terabytes of text, according to Nvidia. Meta’s Llama 2 raised the bar to two trillion tokens, or eight terabytes, while Llama 3 went even further – over 15 trillion tokens, corresponding to at least 60 terabytes of raw text.
Other open datasets such as Falcon RefinedWeb (5 trillion tokens, 20 TB) and The Pile (825 GB of English text) are widely used in research. Common Crawl, the open web archive that underpins many of these models, itself amounts to several hundred terabytes of raw data per snapshot.
Secret numbers but clear patterns
OpenAI has not disclosed the exact number of tokens used to train GPT-4 or GPT-5, but industry analysts estimate that GPT-4 was trained on around 13 trillion tokens – roughly 52 terabytes of text. Although unconfirmed, this gives an indication of the magnitude involved in the most advanced systems.
According to the so-called Chinchilla scaling laws, an optimal balance between model size and token count is crucial for efficiency. This is why today’s AI developers are scaling not only their models but also their data to match.
Why tokens matter
Tokens are the fundamental units that determine how much language, factual knowledge, and context a model can absorb. The more tokens, the broader the model’s understanding of human expression – though this also increases demands on filtering, data quality, and energy efficiency.
In practice, a model trained on trillions of tokens has been exposed to nearly everything ever written in digital form – but not as a simple copy. Much of the material is filtered, deduplicated, and curated to optimise understanding rather than sheer volume.
Newshub Editorial in Europe – 16 October 2025
Recent Comments