The spelled-out intro to language modeling: building makemore
Key Moments
From counting bigrams to a neural net: building a character-level language model for names.
Key Insights
A character-level model can learn to generate name-like strings by modeling sequences of characters, starting from a start token and ending with an end token.
A simple bi-gram (two-character) model can be built by counting occurrences of character pairs and normalizing to probabilities, enabling sampling of new names.
PyTorch tensors and broadcasting enable efficient representation and normalization of bigram counts, plus deterministic sampling with a seeded generator.
Smoothing (adding fake counts) prevents zero probabilities and stabilizes generation; regularization can play a similar stabilizing role in neural nets.
A neural-network extension replaces explicit counts with learned logits, interpreting exponentiated logits as log-counts and using softmax to produce probabilities.
Gradient-based training with negative log-likelihood mirrors likelihood-based counting, but scales to longer contexts and more complex architectures (e.g., transformers).
INTRODUCTION AND GOAL: MAKEMORE FOR NAMES
The project begins with a blank Jupyter notebook and a dataset named names.txt, which contains thousands of names (around 32,000 in the example). The goal is to build a language model that operates at the level of characters, treating each line as a sequence of characters that the model must learn to predict. The intro emphasizes starting small and spelling everything out: first implement a character-level model, then extend to word-level and beyond. The intuition is that the model should learn which characters tend to follow which, thereby generating new, plausible name-like sequences.
DATASET AND CHARACTER-LEVEL MODELING
At the core is a dataset of names where each word is treated as a sequence of characters. We introduce a start token (dot) and an end token to capture token boundaries, so the model can learn which character typically begins a name and which character ends it. The 26 lowercase letters plus two special tokens form the vocabulary, and the modeling is designed around predicting the next character given the previous one. This sets up the notion of a character-level language model, which can eventually scale to longer contexts and even multi-modal outputs.
INITIAL BIGRAM MODEL AND COUNTING
The first concrete model is a bi-gram model that looks at only two consecutive characters. By iterating through every word and over every pair of adjacent characters (including the start and end tokens), we count how often each bigram occurs. The counts are stored in a two-dimensional matrix (an 27-by-27 grid after accounting for the start/end tokens), where the row is the first character and the column is the second. This simple counting approach captures local structure and provides a baseline for sampling and evaluation.
BUILDING THE PROBABILITY MATRIX AND SAMPLING
The raw counts are converted into probabilities by normalizing each row so that the probabilities of all possible next characters sum to one. A mapping between characters and indices (s2i and i2s) is created to enable indexing into the matrix. Sampling begins with the start token (dot) and, at each step, selects the next character according to the corresponding row’s distribution. This yields name-like sequences such as emma, olivia, eva, and so on. Visualization tools help inspect the distribution, revealing which transitions are common or rare.
VISUALIZATION AND STRUCTURE OF THE COUNTS
To understand the model’s structure, the 27-by-27 counts matrix is visualized. The visualization highlights which first letters are common starters and which transitions follow, while also illustrating the constraints imposed by the end-token and start-token tokens. The process also reveals natural zeros (e.g., end token cannot begin a word) and how the presence of the start and end tokens shapes the learned distribution. This step makes abstract statistics tangible and informs subsequent modeling choices.
PROBABILITIES, LOG-LIKELIHOOD, AND LOSS
Beyond sampling, we evaluate the model by inspecting the probabilities assigned to actual bigrams from the training data. The concept of likelihood (the product of probabilities for all observed bigrams) becomes unwieldy, so the log-likelihood (sum of log probabilities) is used, and the negative log-likelihood serves as a loss to minimize. A rough training loss around 2.4 (average negative log-likelihood) indicates the model is learning meaningful structure. Smoothing is introduced to avoid zeros, which would otherwise produce infinite loss.
SMOOTHING AND ITS EQUIVALENCE TO REGULARIZATION
Smoothing adds a small count to every possible bigram, ensuring no probability is zero. This prevents pathological zero-probability transitions and stabilizes generation. The discussion parallels regularization in neural nets: adding a small penalty (like weight decay) nudges the model toward more conservative, smoother predictions. Smoothing is a practical fix for small data or sparse tables, while regularization in neural nets offers a more general mechanism to prevent overfitting and encourage generalizable behavior.
FROM COUNTS TO NEURAL NETWORKS: ONE-HOT ENCODING AND LOGITS
The model transitions from explicit counts to a neural-network framework. The input character is one-hot encoded, producing a 27-dimensional vector that feeds into a simple neural net: a single linear layer mapping 27 inputs to 27 outputs (logits). These logits are interpreted as log-counts; exponentiation yields counts, and normalization (softmax) converts them into probabilities for the next character. This reframing keeps the same probabilistic interpretation while enabling gradient-based optimization and the ability to scale to longer contexts.
TRAINING THE NEURAL NET: LOSS, BACKPROPAGATION, AND OPTIMIZATION
Training proceeds with a forward pass producing a 5-by-27 probability matrix for the five examples from a word like emma. The targets are the actual next characters, and the loss is the mean of the negative log-likelihood across examples. PyTorch autograd is employed: requires_grad is set, loss.backward() computes gradients, and a simple gradient-descent step updates the 27-by-27 weight matrix. The demonstration shows loss decreasing over iterations, aligning the neural net’s probabilities with the training data’s next-character distribution.
EVALUATION: COMPARING COUNTING AND GRADIENT-BASED LEARNING
Two learning paths—count-based normalization and gradient-based optimization—converge to essentially the same model. The count-based approach yields a direct estimate of bigram probabilities, while the neural-net approach learns logits that, after exponentiation and normalization, reproduce similar probabilities. The takeaway is that both methods optimize the same objective: predicting the next character. The neural-net route offers greater scalability to longer contexts and more sophisticated architectures, paving the way toward transformers and more powerful language models.
FUTURE DIRECTIONS: SCALING CONTEXTS AND TRANSFORMERS
The concluding sections point toward extending beyond single-character context to longer histories (e.g., multiple preceding characters) and toward deeper neural networks. The framework remains the same: logits feed into a softmax to produce probabilities, loss guides optimization, and gradients propagate through the network. The path forward envisions richer architectures, regularization strategies, and ultimately transformer-based models capable of handling extensive context, more complex dependencies, and multi-modal outputs like image-text systems, while preserving the core probabilistic training mindset.
Mentioned in This Episode
●Software & Apps
●Tools
●Concepts
Bigram character-level LM — quick cheat sheet
Practical takeaways from this episode
Do This
Avoid This
Common Questions
MakeMore is a character-level language model repo that generates more items like those in the input dataset; the example uses names.txt (~32K names) where each line is a training example. (see 29s)
Topics
Mentioned in this video
Example dataset of around 32,000 names used to train MakeMore; each line is treated as an example sequence of characters.
Mentioned as an example of image-text / multimodal models to explore after character- and word-level language models.
Deep learning framework used to store counts in tensors, perform matrix operations, sample with torch.multinomial, and run training/backprop.
PyTorch function used to sample indices from a multinomial distribution (draw next character according to probabilities).
The repository/project being built: a character-level language model that 'makes more' items like names from a dataset.
The neural-network layer that exponentiates logits and normalizes them into a probability distribution over next characters.
Image-text diffusion model cited as a potential future topic (image + text networks) after finishing character-level modeling.
More from Andrej Karpathy
View all 14 summaries
132 minHow I use LLMs
212 minDeep Dive into LLMs like ChatGPT
242 minLet's reproduce GPT-2 (124M)
134 minLet's build the GPT Tokenizer
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free