How does a bigram character model work?

A bigram model conditions the probability of the next character only on the previous character; you count occurrences of (prev, next) and normalize to get conditional probabilities. (see 345s)

How do you convert counts into probabilities in PyTorch?

Store counts in a torch tensor (N×N), cast to float, sum across rows (keepdim=True) and divide each row by its row-sum to obtain normalized conditional probability rows. (see 774s)

How do you sample names from the trained bigram model?

Start at the start token index, use the row's probability vector to sample the next index with torch.multinomial (replacement=True), continue until the end token is sampled, and repeat. Use a seeded torch.Generator for deterministic demos. (see 1482s)

Why should I use keepdim=True when computing row sums in PyTorch?

Without keepdim=True the summed vector may squeeze to 1D and broadcast in the wrong orientation (becoming a row vector) producing normalization across columns instead of rows; keepdim=True preserves the (N×1) shape so division broadcasts along rows correctly. (see 2665s)

How is model quality measured for language models?

Quality is measured with likelihood; in practice we use average negative log-likelihood (NLL) — lower is better, zero is optimal — computed by summing log-probabilities of the actual next tokens and averaging. (see 3047s)

How can a table-of-counts bigram model be expressed as a neural net?

Interpret each row of a learnable weight matrix W as logits (log-counts) for the next-token distribution given the current token; softmax(W[idx]) matches the normalized count table. You can train W with gradient descent on NLL. (see 3848s)

What prevents the model from assigning zero probability to unseen bigrams?

Use smoothing (add-one or additive smoothing) to add small fake counts to every cell so probabilities are never exactly zero; equivalently, in gradient-based training, L2 regularization or pushing weights toward zero encourages more uniform outputs. (see 2665s and 6719s)

What's the training loop for the neural net bigram model in PyTorch?

Forward: one-hot encode inputs → multiply by W → get logits → softmax to probabilities → index correct labels, compute mean NLL. Backward: zero/reset grads (or set to None), call loss.backward(), and update W via gradient descent (W.data -= lr * W.grad). (see 4436s and 4720s)

Why does initializing W to zeros correspond to a uniform model?

If all logits are zero, softmax yields equal probabilities across tokens, i.e., a uniform distribution — this is analogous to heavy additive smoothing in the counts table. (see 6719s)

Key Moments

The spelled-out intro to language modeling: building makemore

Andrej Karpathy

Science & Technology5 min read118 min video

Sep 7, 2022|1,206,012 views|19,817|1,068

deep learning language model gpt bigram neural network pytorch torch tensor

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

From counting bigrams to a neural net: building a character-level language model for names.

Key Insights

A character-level model can learn to generate name-like strings by modeling sequences of characters, starting from a start token and ending with an end token.

A simple bi-gram (two-character) model can be built by counting occurrences of character pairs and normalizing to probabilities, enabling sampling of new names.

PyTorch tensors and broadcasting enable efficient representation and normalization of bigram counts, plus deterministic sampling with a seeded generator.

Smoothing (adding fake counts) prevents zero probabilities and stabilizes generation; regularization can play a similar stabilizing role in neural nets.

A neural-network extension replaces explicit counts with learned logits, interpreting exponentiated logits as log-counts and using softmax to produce probabilities.

Gradient-based training with negative log-likelihood mirrors likelihood-based counting, but scales to longer contexts and more complex architectures (e.g., transformers).

INTRODUCTION AND GOAL: MAKEMORE FOR NAMES

The project begins with a blank Jupyter notebook and a dataset named names.txt, which contains thousands of names (around 32,000 in the example). The goal is to build a language model that operates at the level of characters, treating each line as a sequence of characters that the model must learn to predict. The intro emphasizes starting small and spelling everything out: first implement a character-level model, then extend to word-level and beyond. The intuition is that the model should learn which characters tend to follow which, thereby generating new, plausible name-like sequences.

DATASET AND CHARACTER-LEVEL MODELING

At the core is a dataset of names where each word is treated as a sequence of characters. We introduce a start token (dot) and an end token to capture token boundaries, so the model can learn which character typically begins a name and which character ends it. The 26 lowercase letters plus two special tokens form the vocabulary, and the modeling is designed around predicting the next character given the previous one. This sets up the notion of a character-level language model, which can eventually scale to longer contexts and even multi-modal outputs.

INITIAL BIGRAM MODEL AND COUNTING

The first concrete model is a bi-gram model that looks at only two consecutive characters. By iterating through every word and over every pair of adjacent characters (including the start and end tokens), we count how often each bigram occurs. The counts are stored in a two-dimensional matrix (an 27-by-27 grid after accounting for the start/end tokens), where the row is the first character and the column is the second. This simple counting approach captures local structure and provides a baseline for sampling and evaluation.

BUILDING THE PROBABILITY MATRIX AND SAMPLING

The raw counts are converted into probabilities by normalizing each row so that the probabilities of all possible next characters sum to one. A mapping between characters and indices (s2i and i2s) is created to enable indexing into the matrix. Sampling begins with the start token (dot) and, at each step, selects the next character according to the corresponding row’s distribution. This yields name-like sequences such as emma, olivia, eva, and so on. Visualization tools help inspect the distribution, revealing which transitions are common or rare.

VISUALIZATION AND STRUCTURE OF THE COUNTS

To understand the model’s structure, the 27-by-27 counts matrix is visualized. The visualization highlights which first letters are common starters and which transitions follow, while also illustrating the constraints imposed by the end-token and start-token tokens. The process also reveals natural zeros (e.g., end token cannot begin a word) and how the presence of the start and end tokens shapes the learned distribution. This step makes abstract statistics tangible and informs subsequent modeling choices.

PROBABILITIES, LOG-LIKELIHOOD, AND LOSS

Beyond sampling, we evaluate the model by inspecting the probabilities assigned to actual bigrams from the training data. The concept of likelihood (the product of probabilities for all observed bigrams) becomes unwieldy, so the log-likelihood (sum of log probabilities) is used, and the negative log-likelihood serves as a loss to minimize. A rough training loss around 2.4 (average negative log-likelihood) indicates the model is learning meaningful structure. Smoothing is introduced to avoid zeros, which would otherwise produce infinite loss.

SMOOTHING AND ITS EQUIVALENCE TO REGULARIZATION

Smoothing adds a small count to every possible bigram, ensuring no probability is zero. This prevents pathological zero-probability transitions and stabilizes generation. The discussion parallels regularization in neural nets: adding a small penalty (like weight decay) nudges the model toward more conservative, smoother predictions. Smoothing is a practical fix for small data or sparse tables, while regularization in neural nets offers a more general mechanism to prevent overfitting and encourage generalizable behavior.

FROM COUNTS TO NEURAL NETWORKS: ONE-HOT ENCODING AND LOGITS

The model transitions from explicit counts to a neural-network framework. The input character is one-hot encoded, producing a 27-dimensional vector that feeds into a simple neural net: a single linear layer mapping 27 inputs to 27 outputs (logits). These logits are interpreted as log-counts; exponentiation yields counts, and normalization (softmax) converts them into probabilities for the next character. This reframing keeps the same probabilistic interpretation while enabling gradient-based optimization and the ability to scale to longer contexts.

TRAINING THE NEURAL NET: LOSS, BACKPROPAGATION, AND OPTIMIZATION

Training proceeds with a forward pass producing a 5-by-27 probability matrix for the five examples from a word like emma. The targets are the actual next characters, and the loss is the mean of the negative log-likelihood across examples. PyTorch autograd is employed: requires_grad is set, loss.backward() computes gradients, and a simple gradient-descent step updates the 27-by-27 weight matrix. The demonstration shows loss decreasing over iterations, aligning the neural net’s probabilities with the training data’s next-character distribution.

EVALUATION: COMPARING COUNTING AND GRADIENT-BASED LEARNING

Two learning paths—count-based normalization and gradient-based optimization—converge to essentially the same model. The count-based approach yields a direct estimate of bigram probabilities, while the neural-net approach learns logits that, after exponentiation and normalization, reproduce similar probabilities. The takeaway is that both methods optimize the same objective: predicting the next character. The neural-net route offers greater scalability to longer contexts and more sophisticated architectures, paving the way toward transformers and more powerful language models.

FUTURE DIRECTIONS: SCALING CONTEXTS AND TRANSFORMERS

The concluding sections point toward extending beyond single-character context to longer histories (e.g., multiple preceding characters) and toward deeper neural networks. The framework remains the same: logits feed into a softmax to produce probabilities, loss guides optimization, and gradients propagate through the network. The path forward envisions richer architectures, regularization strategies, and ultimately transformer-based models capable of handling extensive context, more complex dependencies, and multi-modal outputs like image-text systems, while preserving the core probabilistic training mindset.

Mentioned in This Episode

●Software & Apps

●Tools

●Concepts

Bigram character-level LM — quick cheat sheet

Practical takeaways from this episode

Do This

Do prepend a single start token (.) and append an end token when extracting bigrams from each line.

Do store bigram counts in an N×N matrix (N = vocab size + specials) and normalize rows to get conditional probabilities.

Do use torch.multinomial with a fixed torch.Generator(seed) for deterministic sampling during demos.

Do compute loss as the average negative log-likelihood (NLL) over the dataset for training and evaluation.

Do use in-place operations and correct keepdim semantics when normalizing tensors to avoid silent broadcasting bugs.

Avoid This

Don't forget to convert integer label tensors to float where required (e.g., one-hot → float) before matrix ops.

Don't rely on implicit broadcasting shapes — use keepdim=True when summing rows to ensure division broadcasts the correct way.

Don't sample with replacement=False when you intend to sample multiple times from a distribution; set replacement=True.

Don't expect a simple bigram model to produce fluent long strings — it ignores longer context.

Common Questions

MakeMore is a character-level language model repo that generates more items like those in the input dataset; the example uses names.txt (~32K names) where each line is a training example. (see 29s)

Topics

Bigram Model Negative Log Likelihood Tensor Manipulation One-hot Encoding Smoothing Regularization

Mentioned in this video

Software & Apps

names.txt

Example dataset of around 32,000 names used to train MakeMore; each line is treated as an example sequence of characters.

Dolly

Mentioned as an example of image-text / multimodal models to explore after character- and word-level language models.

PyTorch (torch, torch.tensor)

Deep learning framework used to store counts in tensors, perform matrix operations, sample with torch.multinomial, and run training/backprop.

torch.multinomial

PyTorch function used to sample indices from a multinomial distribution (draw next character according to probabilities).

make more

The repository/project being built: a character-level language model that 'makes more' items like names from a dataset.

Stable Diffusion

Image-text diffusion model cited as a potential future topic (image + text networks) after finishing character-level modeling.

Concepts

softmax

The neural-network layer that exponentiates logits and normalizes them into a probability distribution over next characters.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free