Building makemore Part 2: MLP

Andrej KarpathyAndrej Karpathy
Science & Technology4 min read76 min video
Sep 12, 2022|513,719 views|8,660|469
Save to Pod

Key Moments

TL;DR

MLP char model with embeddings beats bigram; training tricks and sampling.

Key Insights

1

Context explosion in higher-order models motivates embedding-based generalization instead of enumerating all contexts.

2

Bengio et al. 2003 show embedding vectors (for words in their work) allow a neural net to predict for unseen contexts by transferring related information.

3

Efficient embedding lookup in PyTorch (using a shared embedding matrix) enables scalable character-level models with varying context lengths.

4

Cross-entropy loss with softmax is preferred over manual probability calculations for numerical stability and performance.

5

Split data into training, development (validation), and test sets to evaluate hyperparameters and prevent overfitting.

6

Sampling after training demonstrates the model generating more name-like text, revealing progress beyond simple bigrams.

CONTEXT AND MOTIVATION

The lesson begins by contrasting the simplistic bigram model, which uses a single previous character to predict the next one, with the need for longer context. Relying on one-step history leads to poor, name-like quality. Expanding context causes the context space to grow exponentially (e.g., 27 possible chars to the power of N), making counts impractical. A multi-layer perceptron (MLP) approach inspired by Bengio et al. 2003 addresses this by introducing embeddings that map tokens to dense vectors, enabling generalization and transfer of knowledge across similar contexts.

EMBEDDINGS AND NETWORK ARCHITECTURE

In the Bengio-style model, every word (or character) is assigned a low-dimensional embedding (for words this was about 30 dimensions; for characters in our adaptation, a 27-character vocabulary is embedded). The embedding table C maps each token index to its vector. Three embedded inputs are concatenated and fed into a hidden layer of size H, followed by a linear output layer producing logits for the 27 possible next characters. A softmax converts logits to probabilities, and backprop updates the embeddings, hidden weights, and output weights.

DATA PREPARATION AND BLOCKS

The dataset is prepared with a fixed context length, block size, which is the number of preceding characters used to predict the next one. Here, a block size of three characters is used. Context is padded with dots as needed. The code builds X (contexts) and Y (targets) by sliding a window across the text. In development, a tiny example (Emma) illustrates how a fixed context yields multiple input-output pairs. Ultimately, the full training set contains hundreds of thousands of examples, enabling substantial learning.

IMPLEMENTATION DETAILS IN PYTORCH

Key implementation points include using an embedding lookup: embedding vectors are retrieved with c[x], where x is a tensor of indices. This replaces one-hot encoding and makes the model scalable. To feed three embeddings into a single linear layer, you reshape by viewing the 3xD embeddings as a flat 1x(3D) vector, avoiding expensive concatenation. The hidden layer computes a nonlinear activation (10 h in the demonstration) before projecting to 27 logits. Cross-entropy loss with a softmax final layer provides stable, efficient training.

TRAINING STRATEGY AND LOSS

The initial approach computes the negative log-likelihood via a manual softmax-and-count, then improves to using PyTorch's F.cross_entropy for efficiency and numerical stability. Gradients are reset, backpropagated, and parameters updated with a learning-rate schedule. Training on a small batch initially demonstrates rapid gains, but the real power emerges when training on larger batches with minibatches, which smooths the gradient and accelerates convergence while reducing variance in updates.

HYPERPARAMETERS, DEV/VALIDATION SPLIT, AND CAPACITY

A major theme is balancing model capacity and data size. The talk introduces 3-way data splits: training, development (validation), and test, to tune hyperparameters safely and assess generalization. Experiments sweep hidden units, embedding dimensionality, and context length. Initial attempts with small embeddings (2D) show underfitting, improved by larger embeddings (10D) and bigger hidden layers. Despite improvements, the text emphasizes the danger of overfitting as capacity grows, advocating careful validation andout-of-sample testing.

SAMPLING AND TEXT GENERATION

Sampling from the trained model illustrates practical usage: embed the current context, compute logits, apply softmax to obtain probabilities, and draw the next character index from the distribution. Repeating this process yields sequences that appear more word-like than simple bi-grams. The demonstration shows fresh outputs such as name-like strings, indicating learned structure. This section also highlights the practical aspects of decoding and turning token indices back into human-readable text.

COLLABORATION AND FUTURE WORK

A bonus note introduces Google Colab as an accessible way to run the full notebook without local installation. The Colab link enables interactive training, visualization, and sampling directly in the browser. The speaker invites readers to beat the reported validation loss (around 2.17 with tuned settings) by exploring more hyperparameters: embedding size, number of context characters, hidden units, and optimization schedules. The paper by Bengio et al. remains a guide for further improvements and ideas.

Practical cheat sheet for building a character-level MLP language model

Practical takeaways from this episode

Do This

Start with a small context window (e.g., 3 characters) and observe performance.
Use embedding lookup tables to map discrete tokens to continuous vectors.
Train with mini-batches and use cross-entropy loss for the next-token task.
Validate with train/dev/test splits to monitor overfitting.
Experiment with embedding dimensionality and hidden layer size to address bottlenecks.

Avoid This

Don’t rely on a single large context window without sufficient data; it explodes the context space.
Don’t train on the test set; reserve it for final evaluation only.

Common Questions

Using only one previous character yields poor, non-name-like predictions. Extending context to multiple characters helps the model capture longer dependencies, enabling more coherent next-character predictions. This is explained in the early sections as a motivation for the switch to a neural network approach.

Topics

Mentioned in this video

More from Andrej Karpathy

View all 14 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free