How does byte pair encoding (BPE) work?

BPE iteratively finds the most frequent adjacent byte (or symbol) pair in training text and replaces each occurrence with a new token, growing the vocabulary while shortening sequences; the lecture walks through a toy implementation (see 1430).

Why do some strings tokenize differently depending on whitespace or position?

Many tokenizers include whitespace in tokens (space + word is distinct from word at start), and regex chunking prevents merges across certain boundaries, causing identical character sequences to map to different tokens depending on position or case (see 357 and 4348).

Should I use UTF-8 bytes directly as tokens?

Not usually — UTF-8 bytes give only 256 base symbols, which results in long token sequences and inefficient attention; instead, BPE compresses byte sequences into a tunable vocabulary while keeping UTF-8 encoding (see 1100 and 1430).

How do you train a tokenizer from scratch?

Collect a tokenizer training corpus, encode to bytes (UTF‑8), compute frequent adjacent byte pairs, iteratively merge to grow vocabulary to a chosen size, and then save the encoder + merges for inference; a full walkthrough and training loop are shown (see 1623, 2101).

What's stored to represent a trained tokenizer?

Typically two artifacts: a mapping (ID → bytes/string) and a merges list (pairs → new token). OpenAI’s encoder.json (vocab) and vocab.bpe (merges) exemplify this (see 4508).

Why do non-English languages often have longer tokenized sequences?

Tokenizers are trained on mixed corpora; if the tokenizer training data has far more English, it learns longer multi-character tokens for English and breaks other languages into smaller, more numerous tokens, inflating context length usage (see 357).

Why are LLMs sometimes bad at simple arithmetic?

Numbers are tokenized inconsistently (some digit spans become single tokens, others break into multiple tokens), making digit-level arithmetic operations brittle; the tokenization of integers is a major factor (see 6910).

What are 'special tokens' and how are they handled?

Special tokens (e.g., ) are tokens added outside BPE merges to mark document boundaries, message boundaries, or other structural metadata; they are registered and swapped in by tokenizer inference code (see 4711).

Can I add new tokens to a pretrained model?

Yes — you resize the embedding matrix and the final projection to include the new IDs, initialize their embeddings (often randomly) and either fine-tune them or train them with the rest of the model, depending on your use case (see 6210).

What is 'byte-fallback' in SentencePiece and why does it matter?

SentencePiece can operate on Unicode code points and fall back to UTF‑8 bytes for rare code points (byte-fallback). This preserves representation for unseen characters but changes vocabulary layout compared to byte-level BPE and has tradeoffs (see 5323).

Why does a trailing whitespace in a prompt produce a warning or different behavior?

Some tokenizers treat leading/trailing spaces as standalone tokens and completing a partially-started token is an uncommon event in training data; API tooling warns because models often produce worse results if you submit prompts ending in a partial token/space (see 7497).

What caused the 'SolidGoldMagikarp' bizarre behavior?

A mismatch between tokenizer training data and model training data can create tokens that exist in the vocabulary but were never used to train the model embeddings; invoking such untrained embeddings yields undefined or unsafe outputs (see 7497).

Which tokenizer should I use for production?

If you can reuse OpenAI’s GPT‑4/CL100k vocabulary, using tiktoken for inference is recommended for compatibility and efficiency. If training a tokenizer, BPE with a careful training dataset (or SentencePiece with correct options) is commonly used (see 4300 and 5134).

Key Moments

Let's build the GPT Tokenizer

Andrej Karpathy

Science & Technology5 min read134 min video

Feb 20, 2024|1,124,413 views|27,808|1,181

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

Tokenizer deep dive: from UTF-8 bytes to BPE vocab, GPT quirks, and tips.

Key Insights

Tokenization is the hidden engine behind most LLM quirks; small design choices can ripple through performance and behavior.

Byte-pair encoding (BPE) compresses text by iteratively merging frequent adjacent token pairs, growing the vocabulary as you merge.

UTF-8 bytes provide a stable basis for tokenization, but naive byte-level schemes create prohibitively long sequences; BPE densifies input.

Special tokens (end-of-text, start-of-sequence, and model-specific tokens) shape training and inference, often requiring embedding and architecture adjustments.

Tokenizer design varies across models (GPT-2 vs GPT-4 vs LLama) and tools (TikToken vs SentencePiece), each with trade-offs in efficiency, case handling, and unknown tokens.

Tokenization quality influences multilingual support, code handling (e.g., Python spacing), arithmetic tasks, and overall context length utilization.

INTRODUCTION TO TOKENIZATION

The video opens with tokenization framed as a necessary but hairy part of modern language models. Tokenization converts strings into sequences of tokens that the model can process. Early work used naive character-level tokenization on a Shakespeare dataset with a fixed 65-character vocabulary, showing how tokens map to embeddings and feed into Transformer stacks. The speaker emphasizes how small tokenization quirks—like where spaces or punctuation land in token sequences—can create a cascade of behavioral idiosyncrasies in LLMs, underscoring why tokenization deserves careful study.

FROM BYTES TO TOKENS: UTF-8, UNICODE, AND THE NEED FOR A VOCABULARY

The talk moves from Unicode code points to practical encoding. Python ord reveals code points (e.g., 104 for 'h'), but the Unicode space is vast (~150,000 characters), making direct code point indexing impractical. UTF-8 offers a compact, variable-length encoding (1–4 bytes per code point) and backward compatibility with ASCII, making it a preferred base for text pipelines. However, raw UTF-8 bytes produce long sequences; thus we need a smarter tokenization approach that preserves context while keeping the vocabulary manageable and efficient for Transformers.

BPE AS THE CORE OF MODERN TOKENIZATION

Byte Pair Encoding (BPE) is introduced as the primary mechanism to compress the UTF-8 byte stream into a useful token vocabulary. The algorithm iteratively finds the most frequent adjacent pair of tokens and replaces it with a new token, expanding the vocabulary (e.g., from 256 raw bytes to additional merge tokens). This creates a compact representation where common patterns (like 'e ' or common word parts) become single tokens. The result is a shorter token sequence with richer merges, enabling more text to fit into a fixed context window.

TRAINING A TOKENIZER: MERGES, VOCAB, AND HYPERPARAMETERS

Training a tokenizer involves choosing a final vocabulary size and performing a sequence of merges to reach that size. The example starts with 256 initial raw-byte tokens and performs a number of merges (e.g., 20) to create larger tokens, building a forest-like structure of merges rather than a single tree. This process yields compression (e.g., from 24,000 bytes to ~19,000 tokens in a sample) and a density that depends on the chosen vocabulary target. Importantly, the tokenizer is a separate preprocessing stage with its own training data and objectives, impacting multilingual behavior and code handling.

ENCODING AND DECODING: HANDLING BPE MERGES AND UTF-8 ERRORS

Encoding selects tokens from the learned merges and converts text to a token sequence, while decoding reassembles bytes into text. The implementation must carefully map IDs back to bytes and then decode with UTF-8, using strict or forgiving error handling. Real-world issues arise when some token sequences don’t decode cleanly (invalid UTF-8), necessitating errors='replace' to avoid crashes. The decoding path must also respect the insertion order of merges, and must consider the possibility of rare or unknown tokens, which can propagate into the model as untrained embedding vectors.

GPT-2, GPT-4, AND LLAMA: TOKENIZATION AT SCALE

The speaker surveys how tokenization scales across models. GPT-2 historically used ~50,257 tokens with a 1,024-token context, while GPT-4 densifies Python and expands the vocabulary toward ~100,000 tokens. Token density improvements (e.g., grouping multiple spaces into a single token) help preserve code structure and extend usable context. Llama shows how non-English text tends to inflate token counts due to tokenization that splits text into smaller pieces, impacting attention budgets and sequence length in practice.

SPECIAL TOKENS AND REGEX PATTERNS

Special tokens such as end-of-text or start-of-sequence are discussed as essential scaffolding for training and downstream tasks like chat and document delimitation. The GPT-2 tokenizer uses a single end-of-text token, while GPT-4 and friends add more tokens (start, end, prefix, suffix, etc.) to encode conversations and tool usage. In practice, tokenizers may enforce rules that avoid certain merges (via regex or custom logic), preserving semantic boundaries (e.g., not merging letters with adjacent punctuation) and ensuring predictable decode-ability and model behavior.

TOOLING: TIKTOKEN AND SENTENCEPIECE

The talk compares tokenization tools. TikToken is OpenAI’s inference-focused tokenizer (fast and aligned with GPT-4/V4 behavior), while SentencePiece is a training-plus-inference tool used by LLama and other models, offering several training algorithms and a robust, though sometimes convoluted, configuration space. TikToken highlights the preprocessing and special-token handling, whereas SentencePiece emphasizes a broader training regime (including bytes fallback in some modes). The speaker notes caveats about defaults, debugging complexity, and the trade-offs between these ecosystems.

PRACTICAL CHALLENGES: NON-ENGLISH LANGUAGES, CODE, AND CONTEXT LENGTH

Tokenization directly shapes multilingual performance and code handling. English tends to yield longer, denser tokens; non-English languages often incur more tokens for the same content, bloating context requirements. Python code illustrates a critical inefficiency: spaces frequently become independent tokens, wasting context length. The talk emphasizes how tokenizer choices influence arithmetic tasks, spelling, and even logic tasks, and demonstrates how GPT-4’s tokenizer changes addressed Python-wrangling issues by better grouping whitespace.

DESIGN SPACE: PROMPT COMPRESSION, SPECIAL TOKENS, AND MODEL SURGERY

Beyond basic tokenization, models explore design-space options like prompt compression (gist tokens) and adding custom tokens that require minimal model surgery (embedding and output layer resizing) to support new functions. The speaker highlights strategies to inject new behavior without re-training the whole model, enabling efficient fine-tuning. This section also touches on multimodal tokenization approaches and the possibility of tokenizing images or other modalities into the same Transformer framework, leveraging existing tokenization pipelines.

FUTURE DIRECTIONS: TOKENIZATION-FREE IDEAS AND PRACTICAL ADVICE

The talk surveys ambitious ideas like tokenization-free autoregressive modeling, hierarchical Transformers, and fully end-to-end raw byte streams. While tantalizing, these ideas remain largely experimental. In the meantime, pragmatic guidance favors reusing established GPT-4 vocabularies with high-quality tokenizers (TikToken) for inference, and carefully choosing trainer/tokenizer combos for new experiments. The speaker closes by acknowledging the complexity and practical necessity of tokenization, while outlining a path toward more robust or even tokenization-free architectures in the distant future.

Mentioned in This Episode

●Software & Apps

●Tools

●Organizations

●Studies Cited

●Concepts

●People Referenced

Common Questions

Tokenization converts text (Unicode code points) into discrete token IDs that an LLM can embed and process. Many model behaviors (spelling, non-English performance, code handling) trace back to how text is chunked into tokens (see 171).

Topics

Bpe Tiktoken Sentencepiece Utf-8 Tokenizer-training Special-tokens Token-vocabulary Python-code-tokenization Multimodal-tokenization Tokenization-pitfalls Solidgoldmagikarp Token-efficiency

Mentioned in this video

Studies & Research

GPT-2 paper

Referenced as the paper that introduced byte-level tokenization for large language models and motivated many tokenizer design decisions (vocab size, properties).

LLaMA 2 paper

Mentioned to show how tokens are pervasive in model descriptions and to motivate tokenizer training/coverage decisions (e.g., tokens trained on large corpora).

Integer Tokenization is Insane (blog post)

A blog post investigating how numbers are tokenized (variable tokenization for integers) and its impact on arithmetic performance.

UTF-8 Everywhere Manifesto

Referenced as a recommended reading that explains why UTF-8 is the preferred Web encoding (backwards-compatible with ASCII and widely used).

Software & Apps

CL100k (GPT-4 tokenizer)

The tokenizer used for GPT-4 (a ~100k token vocabulary). The video compares it to GPT-2 tokenizer and shows improvements in whitespace/coding efficiency.

tiktoken library

Official OpenAI tokenization library (in Rust) for inference with GPT tokenizers; discussed for usage and special-token handling.

MBP (repo / implementation)

Author's repository and exercises for training a GPT‑4-like tokenizer from scratch (includes training code, visualization of merges, and tests).

fill-in-the-middle special tokens (FIM)

Special tokens (prefix/middle/suffix) used to encode fill-in-the-middle tasks; briefly mentioned as part of GPT-4 tokenizer special tokens.

tiktoken (web app: tiktokenizer/bell.app)

OpenAI's tokenization tooling and a live web demo used in the video to visualize tokenization (GPT-2 vs CL100K/GPT‑4 tokenizers).

OpenAI encoder.py (GPT-2 encoder implementation)

The inference implementation released by OpenAI that applies saved merges and vocabulary — the lecture walks through how it performs BPE at inference time.

SentencePiece

Google's library used for both training and inference (used by LLaMA and others); video explains its different design (codepoint-level BPE, byte-fallback, many options).

GPT-4 tokenizer whitespace improvements (design choice)

Design choice in CL100k/GPT‑4 tokenizer to group repeated spaces (improves code density), explicitly shown with Python indentation examples.

encoder.json & vocab.bpe (OpenAI tokenizer artifacts)

Files used by OpenAI to store the trained tokenizer: encoder (ID→string) and vocab.bpe (merge list). Described as the two items needed to represent a trained tokenizer.

People

SolidGoldMagikarp (Reddit user / token anecdote)

A Reddit username that became its own token in some tokenizers and is used as an example of how tokenizer/training-set mismatch can produce untrained token embeddings and undefined LLM behavior.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free