What is Bengio et al. 2003’s contribution to next-token prediction?

The paper popularized using a multi-layer perceptron for predicting the next word/token with embeddings that map discrete tokens into a continuous space. It demonstrates how context and embedding structure enable generalization to novel sequences.

What is the role of the embedding matrix in this model?

Each token (word or character) is associated with an embedding vector. The embedding matrix is learned during training, and the tokens are mapped to vectors which are then fed into a neural network for prediction.

Why use cross-entropy loss in PyTorch instead of a hand-rolled loss?

Cross-entropy in PyTorch is numerically stable, can fuse computations for speed, and provides a simpler backward pass. It avoids issues like overflow when dealing with softmax probabilities.

How do you perform mini-batch updates in PyTorch for this kind of model?

Select a random subset of indices, gather the corresponding x and y examples, compute the forward pass on the mini-batch, backpropagate, and update parameters. This speeds up training and provides better gradient estimates in practice.

What is the purpose of train/dev/test splits in ML experiments?

Training data optimizes parameters, the development/validation set tunes hyperparameters, and the test set provides an unbiased evaluation of the final model performance.

How is sampling from the trained model performed in this setup?

Starting from a context (e.g., all dots), compute logits for the next token, apply softmax to get a probability distribution, sample from it, append the chosen token to the context, and repeat to generate a sequence.

How is the learning rate chosen and adapted during training?

A learning rate is often scanned over a range (e.g., using an exponential scale) to find a reasonable starting point. Later, a learning rate decay is applied to fine-tune and avoid overshooting as training progresses.

What is the effect of embedding dimensionality on model performance?

Too small embeddings can bottleneck the model’s capacity to distinguish tokens; larger embeddings can improve performance but require more data and compute to train effectively.

Key Moments

Building makemore Part 2: MLP

Andrej Karpathy

Science & Technology4 min read76 min video

Sep 12, 2022|516,635 views|8,703|470

deep learning neural network multilayer perceptron nlp language model

Save to Pod

Key Moments

TL;DR

MLP char model with embeddings beats bigram; training tricks and sampling.

Key Insights

Context explosion in higher-order models motivates embedding-based generalization instead of enumerating all contexts.

Bengio et al. 2003 show embedding vectors (for words in their work) allow a neural net to predict for unseen contexts by transferring related information.

Efficient embedding lookup in PyTorch (using a shared embedding matrix) enables scalable character-level models with varying context lengths.

Cross-entropy loss with softmax is preferred over manual probability calculations for numerical stability and performance.

Split data into training, development (validation), and test sets to evaluate hyperparameters and prevent overfitting.

Sampling after training demonstrates the model generating more name-like text, revealing progress beyond simple bigrams.

CONTEXT AND MOTIVATION

The lesson begins by contrasting the simplistic bigram model, which uses a single previous character to predict the next one, with the need for longer context. Relying on one-step history leads to poor, name-like quality. Expanding context causes the context space to grow exponentially (e.g., 27 possible chars to the power of N), making counts impractical. A multi-layer perceptron (MLP) approach inspired by Bengio et al. 2003 addresses this by introducing embeddings that map tokens to dense vectors, enabling generalization and transfer of knowledge across similar contexts.

EMBEDDINGS AND NETWORK ARCHITECTURE

In the Bengio-style model, every word (or character) is assigned a low-dimensional embedding (for words this was about 30 dimensions; for characters in our adaptation, a 27-character vocabulary is embedded). The embedding table C maps each token index to its vector. Three embedded inputs are concatenated and fed into a hidden layer of size H, followed by a linear output layer producing logits for the 27 possible next characters. A softmax converts logits to probabilities, and backprop updates the embeddings, hidden weights, and output weights.

DATA PREPARATION AND BLOCKS

The dataset is prepared with a fixed context length, block size, which is the number of preceding characters used to predict the next one. Here, a block size of three characters is used. Context is padded with dots as needed. The code builds X (contexts) and Y (targets) by sliding a window across the text. In development, a tiny example (Emma) illustrates how a fixed context yields multiple input-output pairs. Ultimately, the full training set contains hundreds of thousands of examples, enabling substantial learning.

IMPLEMENTATION DETAILS IN PYTORCH

Key implementation points include using an embedding lookup: embedding vectors are retrieved with c[x], where x is a tensor of indices. This replaces one-hot encoding and makes the model scalable. To feed three embeddings into a single linear layer, you reshape by viewing the 3xD embeddings as a flat 1x(3D) vector, avoiding expensive concatenation. The hidden layer computes a nonlinear activation (10 h in the demonstration) before projecting to 27 logits. Cross-entropy loss with a softmax final layer provides stable, efficient training.

TRAINING STRATEGY AND LOSS

The initial approach computes the negative log-likelihood via a manual softmax-and-count, then improves to using PyTorch's F.cross_entropy for efficiency and numerical stability. Gradients are reset, backpropagated, and parameters updated with a learning-rate schedule. Training on a small batch initially demonstrates rapid gains, but the real power emerges when training on larger batches with minibatches, which smooths the gradient and accelerates convergence while reducing variance in updates.

HYPERPARAMETERS, DEV/VALIDATION SPLIT, AND CAPACITY

A major theme is balancing model capacity and data size. The talk introduces 3-way data splits: training, development (validation), and test, to tune hyperparameters safely and assess generalization. Experiments sweep hidden units, embedding dimensionality, and context length. Initial attempts with small embeddings (2D) show underfitting, improved by larger embeddings (10D) and bigger hidden layers. Despite improvements, the text emphasizes the danger of overfitting as capacity grows, advocating careful validation andout-of-sample testing.

SAMPLING AND TEXT GENERATION

Sampling from the trained model illustrates practical usage: embed the current context, compute logits, apply softmax to obtain probabilities, and draw the next character index from the distribution. Repeating this process yields sequences that appear more word-like than simple bi-grams. The demonstration shows fresh outputs such as name-like strings, indicating learned structure. This section also highlights the practical aspects of decoding and turning token indices back into human-readable text.

COLLABORATION AND FUTURE WORK

A bonus note introduces Google Colab as an accessible way to run the full notebook without local installation. The Colab link enables interactive training, visualization, and sampling directly in the browser. The speaker invites readers to beat the reported validation loss (around 2.17 with tuned settings) by exploring more hyperparameters: embedding size, number of context characters, hidden units, and optimization schedules. The paper by Bengio et al. remains a guide for further improvements and ideas.

Mentioned in This Episode

●Software & Apps

●Studies Cited

Practical cheat sheet for building a character-level MLP language model

Practical takeaways from this episode

Do This

Start with a small context window (e.g., 3 characters) and observe performance.

Use embedding lookup tables to map discrete tokens to continuous vectors.

Train with mini-batches and use cross-entropy loss for the next-token task.

Validate with train/dev/test splits to monitor overfitting.

Experiment with embedding dimensionality and hidden layer size to address bottlenecks.

Avoid This

Don’t rely on a single large context window without sufficient data; it explodes the context space.

Don’t train on the test set; reserve it for final evaluation only.

Common Questions

Using only one previous character yields poor, non-name-like predictions. Extending context to multiple characters helps the model capture longer dependencies, enabling more coherent next-character predictions. This is explained in the early sections as a motivation for the switch to a neural network approach.

Topics

Embeddings Bengio 2003 Multi-layer Perceptron Cross-entropy Mini-batch Training Data Splits Embedding Size Attention To Overfitting Neural Language Models

Mentioned in this video

Software & Apps

Google Colab

Colab notebook service used to run the shown code in a browser without local installation.

Studies & Research

Bengio et al., 2003

Influential paper introducing (in context) a multi-layer perceptron approach to predict the next character/token in a sequence; discusses embedding-based representations and a neural network for sequence modeling.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free