The spelled-out intro to language modeling: building makemore

Andrej KarpathyAndrej Karpathy
Science & Technology5 min read118 min video
Sep 7, 2022|1,099,670 views|18,385|1,005
Save to Pod

Key Moments

TL;DR

From counting bigrams to a neural net: building a character-level language model for names.

Key Insights

1

A character-level model can learn to generate name-like strings by modeling sequences of characters, starting from a start token and ending with an end token.

2

A simple bi-gram (two-character) model can be built by counting occurrences of character pairs and normalizing to probabilities, enabling sampling of new names.

3

PyTorch tensors and broadcasting enable efficient representation and normalization of bigram counts, plus deterministic sampling with a seeded generator.

4

Smoothing (adding fake counts) prevents zero probabilities and stabilizes generation; regularization can play a similar stabilizing role in neural nets.

5

A neural-network extension replaces explicit counts with learned logits, interpreting exponentiated logits as log-counts and using softmax to produce probabilities.

6

Gradient-based training with negative log-likelihood mirrors likelihood-based counting, but scales to longer contexts and more complex architectures (e.g., transformers).

INTRODUCTION AND GOAL: MAKEMORE FOR NAMES

The project begins with a blank Jupyter notebook and a dataset named names.txt, which contains thousands of names (around 32,000 in the example). The goal is to build a language model that operates at the level of characters, treating each line as a sequence of characters that the model must learn to predict. The intro emphasizes starting small and spelling everything out: first implement a character-level model, then extend to word-level and beyond. The intuition is that the model should learn which characters tend to follow which, thereby generating new, plausible name-like sequences.

DATASET AND CHARACTER-LEVEL MODELING

At the core is a dataset of names where each word is treated as a sequence of characters. We introduce a start token (dot) and an end token to capture token boundaries, so the model can learn which character typically begins a name and which character ends it. The 26 lowercase letters plus two special tokens form the vocabulary, and the modeling is designed around predicting the next character given the previous one. This sets up the notion of a character-level language model, which can eventually scale to longer contexts and even multi-modal outputs.

INITIAL BIGRAM MODEL AND COUNTING

The first concrete model is a bi-gram model that looks at only two consecutive characters. By iterating through every word and over every pair of adjacent characters (including the start and end tokens), we count how often each bigram occurs. The counts are stored in a two-dimensional matrix (an 27-by-27 grid after accounting for the start/end tokens), where the row is the first character and the column is the second. This simple counting approach captures local structure and provides a baseline for sampling and evaluation.

BUILDING THE PROBABILITY MATRIX AND SAMPLING

The raw counts are converted into probabilities by normalizing each row so that the probabilities of all possible next characters sum to one. A mapping between characters and indices (s2i and i2s) is created to enable indexing into the matrix. Sampling begins with the start token (dot) and, at each step, selects the next character according to the corresponding row’s distribution. This yields name-like sequences such as emma, olivia, eva, and so on. Visualization tools help inspect the distribution, revealing which transitions are common or rare.

VISUALIZATION AND STRUCTURE OF THE COUNTS

To understand the model’s structure, the 27-by-27 counts matrix is visualized. The visualization highlights which first letters are common starters and which transitions follow, while also illustrating the constraints imposed by the end-token and start-token tokens. The process also reveals natural zeros (e.g., end token cannot begin a word) and how the presence of the start and end tokens shapes the learned distribution. This step makes abstract statistics tangible and informs subsequent modeling choices.

PROBABILITIES, LOG-LIKELIHOOD, AND LOSS

Beyond sampling, we evaluate the model by inspecting the probabilities assigned to actual bigrams from the training data. The concept of likelihood (the product of probabilities for all observed bigrams) becomes unwieldy, so the log-likelihood (sum of log probabilities) is used, and the negative log-likelihood serves as a loss to minimize. A rough training loss around 2.4 (average negative log-likelihood) indicates the model is learning meaningful structure. Smoothing is introduced to avoid zeros, which would otherwise produce infinite loss.

SMOOTHING AND ITS EQUIVALENCE TO REGULARIZATION

Smoothing adds a small count to every possible bigram, ensuring no probability is zero. This prevents pathological zero-probability transitions and stabilizes generation. The discussion parallels regularization in neural nets: adding a small penalty (like weight decay) nudges the model toward more conservative, smoother predictions. Smoothing is a practical fix for small data or sparse tables, while regularization in neural nets offers a more general mechanism to prevent overfitting and encourage generalizable behavior.

FROM COUNTS TO NEURAL NETWORKS: ONE-HOT ENCODING AND LOGITS

The model transitions from explicit counts to a neural-network framework. The input character is one-hot encoded, producing a 27-dimensional vector that feeds into a simple neural net: a single linear layer mapping 27 inputs to 27 outputs (logits). These logits are interpreted as log-counts; exponentiation yields counts, and normalization (softmax) converts them into probabilities for the next character. This reframing keeps the same probabilistic interpretation while enabling gradient-based optimization and the ability to scale to longer contexts.

TRAINING THE NEURAL NET: LOSS, BACKPROPAGATION, AND OPTIMIZATION

Training proceeds with a forward pass producing a 5-by-27 probability matrix for the five examples from a word like emma. The targets are the actual next characters, and the loss is the mean of the negative log-likelihood across examples. PyTorch autograd is employed: requires_grad is set, loss.backward() computes gradients, and a simple gradient-descent step updates the 27-by-27 weight matrix. The demonstration shows loss decreasing over iterations, aligning the neural net’s probabilities with the training data’s next-character distribution.

EVALUATION: COMPARING COUNTING AND GRADIENT-BASED LEARNING

Two learning paths—count-based normalization and gradient-based optimization—converge to essentially the same model. The count-based approach yields a direct estimate of bigram probabilities, while the neural-net approach learns logits that, after exponentiation and normalization, reproduce similar probabilities. The takeaway is that both methods optimize the same objective: predicting the next character. The neural-net route offers greater scalability to longer contexts and more sophisticated architectures, paving the way toward transformers and more powerful language models.

FUTURE DIRECTIONS: SCALING CONTEXTS AND TRANSFORMERS

The concluding sections point toward extending beyond single-character context to longer histories (e.g., multiple preceding characters) and toward deeper neural networks. The framework remains the same: logits feed into a softmax to produce probabilities, loss guides optimization, and gradients propagate through the network. The path forward envisions richer architectures, regularization strategies, and ultimately transformer-based models capable of handling extensive context, more complex dependencies, and multi-modal outputs like image-text systems, while preserving the core probabilistic training mindset.

Bigram character-level LM — quick cheat sheet

Practical takeaways from this episode

Do This

Do prepend a single start token (.) and append an end token when extracting bigrams from each line.
Do store bigram counts in an N×N matrix (N = vocab size + specials) and normalize rows to get conditional probabilities.
Do use torch.multinomial with a fixed torch.Generator(seed) for deterministic sampling during demos.
Do compute loss as the average negative log-likelihood (NLL) over the dataset for training and evaluation.
Do use in-place operations and correct keepdim semantics when normalizing tensors to avoid silent broadcasting bugs.

Avoid This

Don't forget to convert integer label tensors to float where required (e.g., one-hot → float) before matrix ops.
Don't rely on implicit broadcasting shapes — use keepdim=True when summing rows to ensure division broadcasts the correct way.
Don't sample with replacement=False when you intend to sample multiple times from a distribution; set replacement=True.
Don't expect a simple bigram model to produce fluent long strings — it ignores longer context.

Common Questions

MakeMore is a character-level language model repo that generates more items like those in the input dataset; the example uses names.txt (~32K names) where each line is a training example. (see 29s)

Topics

Mentioned in this video

More from Andrej Karpathy

View all 14 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free