What is tokenization in AI? Explained simply

What does tokenization mean?

In the first step of text processing, the user input is tokenized. The AI breaks down the input text into smaller units called tokens.

A token can be:

a single character (e.g., “a,” “?”),

a word component (e.g., “pro-” or “-ject”),

or an entire word (e.g., “project management”).

The division chosen depends on the respective tokenization process and, in particular, on the text corpus used for tokenization.

Why is this necessary? Language models work sequentially, generating a response text by building a string of characters step by step. Due to the architecture of the models, a fixed character set is required because the neural networks on which the language models are based have an input and output layer, each containing a specific number of neurons. Each neuron in the output layer is assigned an element from the character set (a token).

The result of a “thinking step” by the neural network is a value distribution at the output layer. The character belonging to the neuron with the highest value is then appended to the character string. The process is repeated until the response text is complete.

Text generation

1. Based on user input and existing response text, the model generates a probability distribution across all tokens.

2. The token with the highest value is selected and appended to the output text.

3. Steps 1 and 2 are repeated until the response is complete.

Screenshot of the OpenAI tokenizer with German tokenization of an introduction.

Why is tokenization important for NLP and AI models?

Tokenization is more than just a technical preparatory process. It determines the framework conditions for working with language models.

What is NLP?

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that deals specifically with the processing and understanding of human language and written texts. Typical applications include chatbots, automatic translation, voice assistants such as Alexa or Siri, text analysis, and sentiment analysis.

The maximum text size that can be processed simultaneously is measured in tokens. The cost is calculated based on the number of tokens in the question and answer.

Maximum text length in tokens

The models work with a fixed character set. The text size that can be processed simultaneously is measured in tokens, not words or characters.

The so-called context window determines how many tokens a model can “see” at once. GPT-4o, for example, can handle up to 128,000 tokens – enough for entire books – and GPT-5 can handle up to 400,000 tokens. However, each additional token requires memory and computing time.

Costs based on tokens

The costs are also calculated based on the number of tokens processed. That is why it matters how efficiently a text is broken down. A text that requires only 20 tokens in English can quickly have one and a half times as many in German, making it more expensive.

Language differences

English requires the fewest tokens, the larger European languages slightly more, and individual rare languages more than ten times as many. This means that English is cheaper, faster, and more efficient – an important practical factor, as we will show later. Specifically, you can expect English texts to have about 10% more tokens than words, while German texts have about 20-30% more tokens than words. Multilingual tokenizers usually need more tokens than monolingual ones. (Source: https://jina.ai/de/news/a-deep-dive-into-tokenization/)

How does tokenization work in practice?

Character tokenization

It would be obvious to use individual letters, numbers, and punctuation marks as tokens. Then the model would need far fewer tokens.

Example:
“Project management” → [“P”, “r”, “o”, “j”, “e”, “c”, “t”, “m”, ...]

This makes the vocabulary very small. However, a single word is broken down into many tokens. The model would therefore have to recognize meanings across very long sequences. This would be inefficient and prone to errors.

Word tokenization

At the other end of the spectrum is breaking down words into whole words:

Example:
“Project management is complex.” → [“project", "management”, “is”, “complex”, “.”]

Using only whole words as tokens is not an option, as the vocabulary would become too large. Even for German, millions of words would have to be taken into account—including inflections, dialects, and neologisms.

Subword tokenization – the standard approach

The solution offered by modern models is a hybrid approach known as subword tokenization: first, all single-character tokens are taken, then character pairs are taken according to frequency, then groups of three and longer chains, until the maximum number of tokens possible for the model is reached. All words that do not correspond to a single token are composed of several tokens. Frequently occurring words or parts of words are given their own tokens, while rarer words are broken down into smaller tokens.

Example:
“Project management” → [“project,” “manage,” “ment”]

Byte-pair encoding

A widely used method for subword tokenization of large text corpora is Byte-Pair Encoding (BPE).

Procedure

We begin with all characters as individual tokens.
Next, we search for the most frequently occurring pairs of characters.
We merge these pairs into new tokens.
We repeat this process until we reach the desired vocabulary size.

Example (simplified):

Sentence: "With the project management software Projektron BCS, you have projects and business processes under control.”

Start: ["W", "i", "t", "h", " ", "t", "h", "e", " ", "p", "r", "o", "j", "e", "c", "t", ...]
First step: find the most common pair. “ro” appears most often (6×) → replace it with ↑
“With the P↑ject management software P↑jekt↑n BCS, you have P↑jects and business p↑cesses under cont↑l.”
Next, ↑j occurs most frequently → replace ↑j with ↔, and also replace ct with ↗
“With the P↔e↗ management software P↔e↗↑n BCS, you have P↔e↗e and business p↑cesses under cont↑l.”
Now we have three new tokens representing ro, ct, and roj.
“_p” (space + “p”) appears three times → replace it with ↘ “With the↘↔e↗ management software↘↔e↗↑n BCS, you have↘↔e↗e and business p↑cesses under cont↑l.”
Then "↘↔”, d, which already corresponds to “_proj”, can be merged into ⇔: With the⇔e↗ management software⇔e↗↑n BCS, you have⇔e↗e and business p↑cesses under cont↑l.”

In a similar fashion, useful units emerge, such as:

“manage” (from “management”)
“software”
“BCS” (as its own token because it appears repeatedly)
“business processes,” split into “business” and “processes”

Result

Instead of storing each character individually, we now have subwords:

["With", "the", "project", "manage", "ment", "-", "software", "Projektron", "BCS", "you", "have", "projects", "and", "business", "processes", "under", "control"]

This produces tokens that are useful for many kinds of text. The method allows large amounts of text to be represented efficiently using a relatively small number of tokens.

Language differences: Example of tokenization of a German sentence vs. an English sentence

To convert a text into tokens on a trial basis, there is a freely accessible page, the tokenizer from OpenAI. If you would like to view the complete token set from OpenAI's GPT-4, you can do so at https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a09193.

As an example, I had the introduction to this article tokenized once in German and once in English and displayed the result once as text and once as token IDs.

Screenshot of the OpenAI tokenizer with English tokenization of the same introduction.

We can see that the English version not only has fewer letters, but is also broken down into 14 fewer tokens than the German version.

What are the challenges of tokenization?

As sophisticated as tokenization is, it repeatedly encounters limitations in practice. Even if a particular model does not use an optimal tokenization strategy, it can learn to make the right decisions from imperfect inputs if the network size is sufficient, there is enough data, and training is adequate. As a result, less effort is put into improving tokenization than into other areas. (Source: https://jina.ai/de/news/a-deep-dive-into-tokenization/). The focus of development is currently elsewhere.

The main challenges are:

1. Language-specific differences

Language models are primarily trained in English. Tokenization works particularly efficiently in this language, as texts require comparatively few tokens. In other languages—such as German, with its long compound words—the number of tokens increases significantly. This not only makes queries more expensive, but also slower.

2. Special characters and emojis

Characters outside normal language usage can also cause problems. Emojis, symbols, or combinations of character strings are often broken down into many individual tokens. A simple “✅⚡” can take up to six tokens, depending on the model.

3. Technical terminology and abbreviations

Rare terms or abbreviations that rarely appear in training material are particularly difficult. An expression such as “PMO board meeting” is often broken down into several tokens, which makes recognition and interpretation difficult. For users, this means that models do not always reliably understand technical language.

4. Limited context window

Finally, the length of the context window is a key limitation. Long texts must be divided into sections once the token limit is reached. How well or poorly this division works depends directly on tokenization—and determines whether contexts are preserved or lost.

What experience does Projektron have with tokenization?

At first glance, this topic sounds very technical. However, in our experience, it has practical implications for everyday work. It is advantageous to write system prompts in English—they are easier to understand, take up less space in the context window, perform better, and are less expensive.

All blog articles on the main topic of AI: AI knowledge (1-4) and AI at Projektron (5-8)

1

Tokenization in AI

Tokenization breaks down texts into manageable building blocks, thereby determining the performance of AI.

2

Vectorization in AI

The basis for semantic search and modern language models: How words become numerical vectors (embeddings).

3

Attention in AI

Transformers such as BERT and GPT rely on the attention mechanism—a principle that recognizes which words in a sentence really matter.

4

RAG in AI

Retrieval-augmented generation combines language models with external knowledge sources to generate more accurate, up-to-date, and verifiable AI responses.

5

AI basics

In a development project in 2023, we laid the initial foundations for the targeted integration of AI into BCS.

6

AI framework for BCS

Projektron is developing a flexible AI framework that can be operated locally and meets the highest standards of precision, data protection, and transparency.

7

AI help in BCS

Since version 25.3, the new BCS AI help function has been providing precise answers to questions about Projektron documentation.

8

Use cases for AI in BCS

Step by step, an AI ecosystem is emerging in BCS that is making everyday work noticeably easier.

FAQ: Frequently asked questions about tokenization

How much text fits into an AI model?

That depends on the so-called context window—the maximum number of tokens that the model can “see” at one time. With GPT-4o, that's up to 128,000 tokens, which is roughly equivalent to 300 pages of text. Newer models such as GPT-5 go even further, managing 400,000 tokens—enough for entire reference books or extensive project documentation. However, it is important to note that the closer you get to this limit, the higher the computing costs and thus also the expenses.

How many words correspond to one token?

There is no direct conversion because tokens can be parts of words, whole words, or even just characters. However, on average, it has been shown that in English texts, one token corresponds to about 0.75 words, while in German it is only about 0.5 words. This is due to the longer, compound words in German. Thus, the same sentence can require a different number of tokens depending on the language—which in turn has an impact on the cost and speed of processing.

Can I calculate the number of tokens in advance?

Yes, there are practical tools for this. Online tools such as tiktokenizer.io or directly in the OpenAI platform can be used to reliably determine the number of tokens in a text. Libraries such as tiktoken (Python) can also be integrated locally. For companies or developers, it makes sense to check texts before transferring them to a language model – this allows costs to be calculated and avoids the risk of the text exceeding the context limit.

Why is AI cheaper in English?

The main reason is token efficiency. Since most models are based primarily on English training data, English texts are tokenized much more compactly. A sentence in English often requires only 80 percent of the tokens that a German sentence would require. This means that queries in English take up less space in the context window, run faster, and cost less money. That's why many developers – including us at Projektron – recommend writing system prompts in English whenever possible, even if the output can later be displayed in German.

Conclusion: Small units, big impact

Tokenization is an inconspicuous but central component in the functioning of modern language models. It determines how efficiently and accurately texts are processed and thus influences the cost, speed, and comprehensibility of AI systems.

For companies that use AI in a targeted manner, it is helpful to understand the mechanisms behind it. At Projektron, a better understanding of tokenization enabled us to make our AI help in BCS more powerful, affordable, and user-friendly.

To learn how we developed the Projektron AI help and what role tokenization plays in the background, read the article “Projektron and AI: Experiences in developing and optimizing BCS help.”

About the author

Dr. Marten Huisinga heads teknow GmbH, a platform for laser sheet metal cutting. In the future, AI methods will simplify the offering for amateur customers. Huisinga was one of the three founders and, until 2015, co-managing director of Projektron GmbH, for which he now works as a consultant. As DPO, he is responsible for implementing the first AI applications in order to assess the benefits of AI for BCS and Projektron GmbH.

Experience and test BCS live

Make an appointment now for a free, no-obligation online presentation and get to know BCS as project management software with ERP functions.

Test BCS free of charge

Product management at Projektron

How does software remain successful for 25 years? Projektron BCS shows that continuous updates, user feedback, and modern technologies ensure long-term success. Learn how product management works at Projektron.

Use cases for AI in BCS

Step by step, an AI ecosystem is emerging at BCS that is making everyday work noticeably easier. The article shows which use cases are already productive and which functions are still to come.

AI-Help in BCS

Since version 25.3, the new BCS AI user help has been providing precise answers to questions about Projektron documentation. The article shows how iterative optimizations in retrieval and splitting have significantly improved the quality of responses.

Choosing PM software

If your SME or company is about to choose project management software, you probably don't know where to start looking for the right PM tool for you. This guide will guide you through the PM software market and lead you to the right decision in 9 steps.

Contents

What does tokenization mean?

Text generation

Why is tokenization important for NLP and AI models?

What is NLP?

Maximum text length in tokens

Costs based on tokens

Language differences

How does tokenization work in practice?

Character tokenization

Word tokenization

Subword tokenization – the standard approach

Byte-pair encoding

Procedure

Example (simplified):

Result

Language differences: Example of tokenization of a German sentence vs. an English sentence

What are the challenges of tokenization?

1. Language-specific differences

2. Special characters and emojis

3. Technical terminology and abbreviations

4. Limited context window

What experience does Projektron have with tokenization?

All blog articles on the main topic of AI: AI knowledge (1-4) and AI at Projektron (5-8)

Tokenization in AI

Vectorization in AI

Attention in AI

RAG in AI

AI basics

AI framework for BCS

AI help in BCS

Use cases for AI in BCS

FAQ: Frequently asked questions about tokenization

How much text fits into an AI model?

How much text fits into an AI model?

How many words correspond to one token?

How many words correspond to one token?

Can I calculate the number of tokens in advance?

Can I calculate the number of tokens in advance?

Why is AI cheaper in English?

Why is AI cheaper in English?

Conclusion: Small units, big impact

About the author

Experience and test BCS live

More interesting articles on the Projektron blog

Product management at Projektron

Use cases for AI in BCS

AI-Help in BCS

Choosing PM software