Let's build the GPT Tokenizer

Andrej KarpathyAndrej Karpathy
Science & Technology5 min read134 min video
Feb 20, 2024|1,045,822 views|26,607|1,167
Save to Pod

Key Moments

TL;DR

Tokenizer deep dive: from UTF-8 bytes to BPE vocab, GPT quirks, and tips.

Key Insights

1

Tokenization is the hidden engine behind most LLM quirks; small design choices can ripple through performance and behavior.

2

Byte-pair encoding (BPE) compresses text by iteratively merging frequent adjacent token pairs, growing the vocabulary as you merge.

3

UTF-8 bytes provide a stable basis for tokenization, but naive byte-level schemes create prohibitively long sequences; BPE densifies input.

4

Special tokens (end-of-text, start-of-sequence, and model-specific tokens) shape training and inference, often requiring embedding and architecture adjustments.

5

Tokenizer design varies across models (GPT-2 vs GPT-4 vs LLama) and tools (TikToken vs SentencePiece), each with trade-offs in efficiency, case handling, and unknown tokens.

6

Tokenization quality influences multilingual support, code handling (e.g., Python spacing), arithmetic tasks, and overall context length utilization.

INTRODUCTION TO TOKENIZATION

The video opens with tokenization framed as a necessary but hairy part of modern language models. Tokenization converts strings into sequences of tokens that the model can process. Early work used naive character-level tokenization on a Shakespeare dataset with a fixed 65-character vocabulary, showing how tokens map to embeddings and feed into Transformer stacks. The speaker emphasizes how small tokenization quirks—like where spaces or punctuation land in token sequences—can create a cascade of behavioral idiosyncrasies in LLMs, underscoring why tokenization deserves careful study.

FROM BYTES TO TOKENS: UTF-8, UNICODE, AND THE NEED FOR A VOCABULARY

The talk moves from Unicode code points to practical encoding. Python ord reveals code points (e.g., 104 for 'h'), but the Unicode space is vast (~150,000 characters), making direct code point indexing impractical. UTF-8 offers a compact, variable-length encoding (1–4 bytes per code point) and backward compatibility with ASCII, making it a preferred base for text pipelines. However, raw UTF-8 bytes produce long sequences; thus we need a smarter tokenization approach that preserves context while keeping the vocabulary manageable and efficient for Transformers.

BPE AS THE CORE OF MODERN TOKENIZATION

Byte Pair Encoding (BPE) is introduced as the primary mechanism to compress the UTF-8 byte stream into a useful token vocabulary. The algorithm iteratively finds the most frequent adjacent pair of tokens and replaces it with a new token, expanding the vocabulary (e.g., from 256 raw bytes to additional merge tokens). This creates a compact representation where common patterns (like 'e ' or common word parts) become single tokens. The result is a shorter token sequence with richer merges, enabling more text to fit into a fixed context window.

TRAINING A TOKENIZER: MERGES, VOCAB, AND HYPERPARAMETERS

Training a tokenizer involves choosing a final vocabulary size and performing a sequence of merges to reach that size. The example starts with 256 initial raw-byte tokens and performs a number of merges (e.g., 20) to create larger tokens, building a forest-like structure of merges rather than a single tree. This process yields compression (e.g., from 24,000 bytes to ~19,000 tokens in a sample) and a density that depends on the chosen vocabulary target. Importantly, the tokenizer is a separate preprocessing stage with its own training data and objectives, impacting multilingual behavior and code handling.

ENCODING AND DECODING: HANDLING BPE MERGES AND UTF-8 ERRORS

Encoding selects tokens from the learned merges and converts text to a token sequence, while decoding reassembles bytes into text. The implementation must carefully map IDs back to bytes and then decode with UTF-8, using strict or forgiving error handling. Real-world issues arise when some token sequences don’t decode cleanly (invalid UTF-8), necessitating errors='replace' to avoid crashes. The decoding path must also respect the insertion order of merges, and must consider the possibility of rare or unknown tokens, which can propagate into the model as untrained embedding vectors.

GPT-2, GPT-4, AND LLAMA: TOKENIZATION AT SCALE

The speaker surveys how tokenization scales across models. GPT-2 historically used ~50,257 tokens with a 1,024-token context, while GPT-4 densifies Python and expands the vocabulary toward ~100,000 tokens. Token density improvements (e.g., grouping multiple spaces into a single token) help preserve code structure and extend usable context. Llama shows how non-English text tends to inflate token counts due to tokenization that splits text into smaller pieces, impacting attention budgets and sequence length in practice.

SPECIAL TOKENS AND REGEX PATTERNS

Special tokens such as end-of-text or start-of-sequence are discussed as essential scaffolding for training and downstream tasks like chat and document delimitation. The GPT-2 tokenizer uses a single end-of-text token, while GPT-4 and friends add more tokens (start, end, prefix, suffix, etc.) to encode conversations and tool usage. In practice, tokenizers may enforce rules that avoid certain merges (via regex or custom logic), preserving semantic boundaries (e.g., not merging letters with adjacent punctuation) and ensuring predictable decode-ability and model behavior.

TOOLING: TIKTOKEN AND SENTENCEPIECE

The talk compares tokenization tools. TikToken is OpenAI’s inference-focused tokenizer (fast and aligned with GPT-4/V4 behavior), while SentencePiece is a training-plus-inference tool used by LLama and other models, offering several training algorithms and a robust, though sometimes convoluted, configuration space. TikToken highlights the preprocessing and special-token handling, whereas SentencePiece emphasizes a broader training regime (including bytes fallback in some modes). The speaker notes caveats about defaults, debugging complexity, and the trade-offs between these ecosystems.

PRACTICAL CHALLENGES: NON-ENGLISH LANGUAGES, CODE, AND CONTEXT LENGTH

Tokenization directly shapes multilingual performance and code handling. English tends to yield longer, denser tokens; non-English languages often incur more tokens for the same content, bloating context requirements. Python code illustrates a critical inefficiency: spaces frequently become independent tokens, wasting context length. The talk emphasizes how tokenizer choices influence arithmetic tasks, spelling, and even logic tasks, and demonstrates how GPT-4’s tokenizer changes addressed Python-wrangling issues by better grouping whitespace.

DESIGN SPACE: PROMPT COMPRESSION, SPECIAL TOKENS, AND MODEL SURGERY

Beyond basic tokenization, models explore design-space options like prompt compression (gist tokens) and adding custom tokens that require minimal model surgery (embedding and output layer resizing) to support new functions. The speaker highlights strategies to inject new behavior without re-training the whole model, enabling efficient fine-tuning. This section also touches on multimodal tokenization approaches and the possibility of tokenizing images or other modalities into the same Transformer framework, leveraging existing tokenization pipelines.

FUTURE DIRECTIONS: TOKENIZATION-FREE IDEAS AND PRACTICAL ADVICE

The talk surveys ambitious ideas like tokenization-free autoregressive modeling, hierarchical Transformers, and fully end-to-end raw byte streams. While tantalizing, these ideas remain largely experimental. In the meantime, pragmatic guidance favors reusing established GPT-4 vocabularies with high-quality tokenizers (TikToken) for inference, and carefully choosing trainer/tokenizer combos for new experiments. The speaker closes by acknowledging the complexity and practical necessity of tokenization, while outlining a path toward more robust or even tokenization-free architectures in the distant future.

Common Questions

Tokenization converts text (Unicode code points) into discrete token IDs that an LLM can embed and process. Many model behaviors (spelling, non-English performance, code handling) trace back to how text is chunked into tokens (see 171).

Topics

Mentioned in this video

studyGPT-2 paper

Referenced as the paper that introduced byte-level tokenization for large language models and motivated many tokenizer design decisions (vocab size, properties).

studyLLaMA 2 paper

Mentioned to show how tokens are pervasive in model descriptions and to motivate tokenizer training/coverage decisions (e.g., tokens trained on large corpora).

toolCL100k (GPT-4 tokenizer)

The tokenizer used for GPT-4 (a ~100k token vocabulary). The video compares it to GPT-2 tokenizer and shows improvements in whitespace/coding efficiency.

tooltiktoken library

Official OpenAI tokenization library (in Rust) for inference with GPT tokenizers; discussed for usage and special-token handling.

toolMBP (repo / implementation)

Author's repository and exercises for training a GPT‑4-like tokenizer from scratch (includes training code, visualization of merges, and tests).

studyInteger Tokenization is Insane (blog post)

A blog post investigating how numbers are tokenized (variable tokenization for integers) and its impact on arithmetic performance.

toolfill-in-the-middle special tokens (FIM)

Special tokens (prefix/middle/suffix) used to encode fill-in-the-middle tasks; briefly mentioned as part of GPT-4 tokenizer special tokens.

tooltiktoken (web app: tiktokenizer/bell.app)

OpenAI's tokenization tooling and a live web demo used in the video to visualize tokenization (GPT-2 vs CL100K/GPT‑4 tokenizers).

toolOpenAI encoder.py (GPT-2 encoder implementation)

The inference implementation released by OpenAI that applies saved merges and vocabulary — the lecture walks through how it performs BPE at inference time.

toolSentencePiece

Google's library used for both training and inference (used by LLaMA and others); video explains its different design (codepoint-level BPE, byte-fallback, many options).

studyUTF-8 Everywhere Manifesto

Referenced as a recommended reading that explains why UTF-8 is the preferred Web encoding (backwards-compatible with ASCII and widely used).

personSolidGoldMagikarp (Reddit user / token anecdote)

A Reddit username that became its own token in some tokenizers and is used as an example of how tokenizer/training-set mismatch can produce untrained token embeddings and undefined LLM behavior.

toolGPT-4 tokenizer whitespace improvements (design choice)

Design choice in CL100k/GPT‑4 tokenizer to group repeated spaces (improves code density), explicitly shown with Python indentation examples.

toolencoder.json & vocab.bpe (OpenAI tokenizer artifacts)

Files used by OpenAI to store the trained tokenizer: encoder (ID→string) and vocab.bpe (merge list). Described as the two items needed to represent a trained tokenizer.

More from Andrej Karpathy

View all 14 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free