Tokenization Explained: How LLMs Split Text and Charge by Tokens in 5 Minutes - Devuly | Smart Analytics for Developers & Projects

Tokens are the smallest units large language models use to process text. They determine input length, inference cost, and context capacity. This article uses a hands-on Python example to show how a tokenizer splits text, maps tokens to IDs, and counts them, helping developers understand where billing comes from and how to optimize it. Keywords: Token, Tokenizer, Transformers.

Table of Contents

Technical specifications at a glance

Parameter	Description
Topic	LLM token fundamentals and hands-on practice
Language	Python
Core protocol/ecosystem	Hugging Face Transformers, local tokenizer
GitHub stars	Not provided in the original article
Core dependencies	transformers, torch
Applicable scenarios	Prompt optimization, cost control, context analysis

Tokens are the smallest processing units that LLMs use to understand text

A token is neither a character count nor a word count. It is a fragment of text produced by a model tokenizer. In English, a token is often a full word, a word stem, or punctuation. In Chinese, it is often one to two characters, but that is not a strict rule.

This means the same sentence can produce different token counts in different models. The reason is not the size of the model parameters, but the fact that each model may use a different tokenizer and vocabulary. As a result, segmentation results, length statistics, and final cost can all vary.

AI Visual Insight: This image provides an intuitive view of the token concept. The key idea is that natural language does not go directly into the model. Instead, it is first split into multiple discrete text fragments. The figure highlights that tokens sit between “raw text” and “model computation” and serve as the foundational unit for context length, billing, and semantic encoding.

Tokens are not fixed-length character slices

Many beginners assume that one token always equals one character or one word. In reality, token boundaries are determined by the tokenization algorithm. Common algorithms prioritize frequent text fragments and then split more complex words into smaller subword units to balance vocabulary size and generalization ability.

text_examples = ["hello world", "你好，我是cool。", "Tokenization"]

# This is only a demonstration: the same text usually produces different token counts under different tokenizers
for text in text_examples:
    print(text)  # Print the text to analyze

This code illustrates a simple point: token length depends on the tokenizer, not on the number of visible characters.

Token optimization directly affects model cost and response efficiency

The original article emphasizes token optimization, and this is critical in real-world development. In most cases, the cost and latency of a single LLM call strongly correlate with the total number of input and output tokens.

AI Visual Insight: This image shows the token optimization pipeline: starting from the original prompt, you compress descriptions, reduce redundant context, and constrain output formats to lower total token usage. The technical focus is on reducing ineffective context, not simply shortening the number of characters.

The total tokens for one model call consist of input and output

The total usage for a single request can be expressed as:

Total Tokens = Input Tokens + Output Tokens

The input side usually includes four cost categories: the user question, the system prompt, conversation history, and the formatting overhead introduced by message packaging.

That is why a question that “looks short” can still become expensive in a multi-turn conversation. The model does not receive only the current sentence. It receives the full conversation state.

AI Visual Insight: This image breaks down the token composition of an LLM call. It shows that input tokens are divided into four layers: user query, system instructions, historical messages, and protocol overhead. It reveals the real reason costs grow in multi-turn conversations: accumulated context, not just a single long question.

# Estimate the total token cost of one request
input_tokens = 120   # User input + system prompt + history + formatting overhead
output_tokens = 80   # Model-generated content

total_tokens = input_tokens + output_tokens  # Total token usage
print("Total usage:", total_tokens)

This code expresses the minimum calculation model behind token-based billing.

Different models can produce different tokenization results for the same text

The same text does not map to a universal token sequence across models. The root cause is that each model vendor may maintain its own vocabulary, merge rules, and encoding strategy. As a result, token counts are only meaningful for comparison within the same model family.

AI Visual Insight: This image highlights a key technical fact: the same text can produce different token results across different models. It typically uses side-by-side examples to show tokenizer differences, demonstrating why token statistics cannot be mechanically reused across models without risking incorrect estimates of context limits and API costs.

You can use Transformers to inspect how a tokenizer works directly

If you want to truly understand tokens, the most effective approach is not to memorize definitions but to run tokenization yourself. The original article uses transformers and torch as dependencies and relies on a local Qwen2 tokenizer to perform segmentation, ID conversion, counting, and decoding.

pip install transformers torch

This command installs the core dependencies required for tokenization and inference experiments.

The Python example below walks through the full token generation process

from transformers import AutoTokenizer

# Load a local tokenizer; replace the name with your actual directory or model name
tokenizer = AutoTokenizer.from_pretrained("Qwen2_tokenizer")

# Text to process, mixing Chinese, punctuation, and English to make segmentation differences easier to observe
text = "你好，我是cool。"

# Step 1: Run tokenization to get a list of tokens in string form
bpe_tokens = tokenizer.tokenize(text)
print("Original tokens:", bpe_tokens)

# Step 2: Convert tokens into integer IDs from the vocabulary
# The model accepts numbers, not raw natural language text
token_ids = tokenizer.convert_tokens_to_ids(bpe_tokens)
print("Token IDs:", token_ids)

# Step 3: Decode each token one by one to inspect the readable fragment each token represents
decoded_parts = []
for token in bpe_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)  # Convert a single token to an ID
    piece = tokenizer.decode([token_id])               # Decode a single ID back to a text fragment
    decoded_parts.append(piece)

print("Segmentation result:", decoded_parts)

# Step 4: Count the number of tokens for length and cost analysis
count = len(token_ids)
print("Total tokens:", count)

# Step 5: Decode the full ID sequence back into text to verify reversibility
print("Decoded result:", tokenizer.decode(token_ids))

This code covers the four key actions of a tokenizer: segmentation, indexing, counting, and decoding.

This code helps you see what actually happens before text reaches the model

tokenize() splits text into subword units, convert_tokens_to_ids() maps text symbols into integers, and decode() performs the reverse check to verify whether those IDs can be reconstructed into readable text. This workflow is the standard preprocessing pipeline that natural language goes through before entering the model.

This experiment is also useful in prompt engineering. You can use it to compare the token cost of different phrasings and find prompt templates that remain clear while consuming fewer tokens.

Developers should optimize context first instead of blindly compressing output

In real projects, the most common waste usually does not come from the answer. It comes from the input. Lengthy system prompts, repeated history messages, and unstructured multi-turn context often consume far more tokens than the output itself.

A more effective strategy is to compress system prompts, summarize prior conversation, constrain response formats, and trim retrieved content on demand. Focus your optimization on high-information-density input instead of simply chasing shorter text.

FAQ

Why does the same sentence have a different token count in different models?

Because different models use different tokenizers, vocabularies, and segmentation rules. Even when the meaning is identical, the resulting token sequence and length can differ significantly.

What is the relationship between tokens, character counts, and word counts?

There is no fixed one-to-one mapping. A token is an internal segmentation unit used by the model. It may represent one character, part of a word, a word stem, or even punctuation.

How can I quickly reduce token cost in development?

Start by removing redundant system instructions, compressing conversation history, limiting output length, and summarizing retrieval results. These methods are usually more effective than simply shortening the user’s question.

Core summary: This article explains token definitions, segmentation rules, billing mechanics, and optimization strategies in clear language. It also demonstrates the full workflow of tokenization, ID conversion, counting, and decoding with Python and Transformers, helping developers quickly understand the cost structure behind LLM inputs and outputs.