Building makemore Part 2: MLP
Key Moments
MLP char model with embeddings beats bigram; training tricks and sampling.
Key Insights
Context explosion in higher-order models motivates embedding-based generalization instead of enumerating all contexts.
Bengio et al. 2003 show embedding vectors (for words in their work) allow a neural net to predict for unseen contexts by transferring related information.
Efficient embedding lookup in PyTorch (using a shared embedding matrix) enables scalable character-level models with varying context lengths.
Cross-entropy loss with softmax is preferred over manual probability calculations for numerical stability and performance.
Split data into training, development (validation), and test sets to evaluate hyperparameters and prevent overfitting.
Sampling after training demonstrates the model generating more name-like text, revealing progress beyond simple bigrams.
CONTEXT AND MOTIVATION
The lesson begins by contrasting the simplistic bigram model, which uses a single previous character to predict the next one, with the need for longer context. Relying on one-step history leads to poor, name-like quality. Expanding context causes the context space to grow exponentially (e.g., 27 possible chars to the power of N), making counts impractical. A multi-layer perceptron (MLP) approach inspired by Bengio et al. 2003 addresses this by introducing embeddings that map tokens to dense vectors, enabling generalization and transfer of knowledge across similar contexts.
EMBEDDINGS AND NETWORK ARCHITECTURE
In the Bengio-style model, every word (or character) is assigned a low-dimensional embedding (for words this was about 30 dimensions; for characters in our adaptation, a 27-character vocabulary is embedded). The embedding table C maps each token index to its vector. Three embedded inputs are concatenated and fed into a hidden layer of size H, followed by a linear output layer producing logits for the 27 possible next characters. A softmax converts logits to probabilities, and backprop updates the embeddings, hidden weights, and output weights.
DATA PREPARATION AND BLOCKS
The dataset is prepared with a fixed context length, block size, which is the number of preceding characters used to predict the next one. Here, a block size of three characters is used. Context is padded with dots as needed. The code builds X (contexts) and Y (targets) by sliding a window across the text. In development, a tiny example (Emma) illustrates how a fixed context yields multiple input-output pairs. Ultimately, the full training set contains hundreds of thousands of examples, enabling substantial learning.
IMPLEMENTATION DETAILS IN PYTORCH
Key implementation points include using an embedding lookup: embedding vectors are retrieved with c[x], where x is a tensor of indices. This replaces one-hot encoding and makes the model scalable. To feed three embeddings into a single linear layer, you reshape by viewing the 3xD embeddings as a flat 1x(3D) vector, avoiding expensive concatenation. The hidden layer computes a nonlinear activation (10 h in the demonstration) before projecting to 27 logits. Cross-entropy loss with a softmax final layer provides stable, efficient training.
TRAINING STRATEGY AND LOSS
The initial approach computes the negative log-likelihood via a manual softmax-and-count, then improves to using PyTorch's F.cross_entropy for efficiency and numerical stability. Gradients are reset, backpropagated, and parameters updated with a learning-rate schedule. Training on a small batch initially demonstrates rapid gains, but the real power emerges when training on larger batches with minibatches, which smooths the gradient and accelerates convergence while reducing variance in updates.
HYPERPARAMETERS, DEV/VALIDATION SPLIT, AND CAPACITY
A major theme is balancing model capacity and data size. The talk introduces 3-way data splits: training, development (validation), and test, to tune hyperparameters safely and assess generalization. Experiments sweep hidden units, embedding dimensionality, and context length. Initial attempts with small embeddings (2D) show underfitting, improved by larger embeddings (10D) and bigger hidden layers. Despite improvements, the text emphasizes the danger of overfitting as capacity grows, advocating careful validation andout-of-sample testing.
SAMPLING AND TEXT GENERATION
Sampling from the trained model illustrates practical usage: embed the current context, compute logits, apply softmax to obtain probabilities, and draw the next character index from the distribution. Repeating this process yields sequences that appear more word-like than simple bi-grams. The demonstration shows fresh outputs such as name-like strings, indicating learned structure. This section also highlights the practical aspects of decoding and turning token indices back into human-readable text.
COLLABORATION AND FUTURE WORK
A bonus note introduces Google Colab as an accessible way to run the full notebook without local installation. The Colab link enables interactive training, visualization, and sampling directly in the browser. The speaker invites readers to beat the reported validation loss (around 2.17 with tuned settings) by exploring more hyperparameters: embedding size, number of context characters, hidden units, and optimization schedules. The paper by Bengio et al. remains a guide for further improvements and ideas.
Mentioned in This Episode
●Software & Apps
●Studies Cited
Practical cheat sheet for building a character-level MLP language model
Practical takeaways from this episode
Do This
Avoid This
Common Questions
Using only one previous character yields poor, non-name-like predictions. Extending context to multiple characters helps the model capture longer dependencies, enabling more coherent next-character predictions. This is explained in the early sections as a motivation for the switch to a neural network approach.
Topics
Mentioned in this video
Colab notebook service used to run the shown code in a browser without local installation.
Influential paper introducing (in context) a multi-layer perceptron approach to predict the next character/token in a sequence; discusses embedding-based representations and a neural network for sequence modeling.
More from Andrej Karpathy
View all 14 summaries
132 minHow I use LLMs
212 minDeep Dive into LLMs like ChatGPT
242 minLet's reproduce GPT-2 (124M)
134 minLet's build the GPT Tokenizer
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free