Why increase context length from 3 to 8 characters?

Increasing the context length allows the model to condition its prediction on more prior information. In the video this change yields a noticeable drop in validation loss (from about 2.10 to around 2.02), indicating better texture and more name-like samples.

What is the 'bathroom' (batch norm) layer and why is it tricky?

The 'bathroom' layer refers to a BatchNorm-like layer with running statistics that are updated outside backpropagation. It is stateful and must be toggled between train and eval modes. Improperly handling its mode or misaligning its statistics can cause unstable training or wrong samples.

What are embedding and flatten modules used for?

Embedding maps discrete token indices to dense vectors; Flatten reshapes the three-dimensional embedding output into a suitable 2D/3D form for matrix multiplications. Together they replace ad-hoc indexing and manual reshaping, enabling a cleaner, PyTorch-like modular design.

What is the point of a Sequential container in this code?

Sequential acts as a simple container that feeds an input through a list of layers in order, producing the final output. It mirrors PyTorch's nn.Sequential and makes the forward pass cleaner and easier to manage as the network grows.

Why is sampling sensitive to BN training vs eval state?

During sampling, you want to use the running estimates of mean/variance (eval mode). If BN uses minibatch statistics from training mode, the sample outputs can be misleading or unstable. The fix is to set BN to eval state when generating samples.

What is progressive fusion / hierarchical structure in this WaveNet-inspired model?

Progressive fusion stacks context in a tree-like fashion: two characters form a unit, then two-grams, then four-grams, etc., allowing deeper layers to progressively fuse information rather than collapsing everything at the first layer. This mirrors the hierarchical nature of dilated convolutional structures.

What are dilated causal convolutions and why are they mentioned for future work?

Dilated causal convolutions extend receptive fields without huge computational costs, enabling deeper, more efficient models that can capture long-range dependencies. They are central to WaveNet-style architectures and are planned for future exploration in this course.

Key Moments

Building makemore Part 5: Building a WaveNet

Andrej Karpathy

Science & Technology5 min read57 min video

Nov 21, 2022|283,484 views|4,751|239

deep learning neural network language model tensors pytorch convolution

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

WaveNet-inspired hierarchical fusion for character prediction with PyTorch-like modules.

Key Insights

Increasing context length from 3 to 8 improves validation loss (2.10 -> ~2.02) and sample quality.

The project progresses from a flat MLP to modular building blocks (embedding, flatten, sequential) inspired by PyTorch.

A hierarchical fusion approach is introduced, progressively combining context (2 chars -> 4, 8, etc.) to form deeper representations.

BatchNorm behavior across multiple batch dimensions is subtle and bug-prone; fixing it improves stability and performance.

Using a Sequential-like container simplifies forward passes and keeps parameter management clean and scalable.

Larger embeddings and deeper/hierarchical networks yield further gains (e.g., ~1.99 validation) but require longer training and tuning.

EMBRACING A LONGER CONTEXT: FROM 3-CHAR TO 8-CHAR BLOCKS

The video describes a shift from predicting the next character using a tiny window of three characters to a broader, eight-character block. This change amplifies the input dimensionality and alters the first linear layer, increasing the parameter count by roughly 10k. The larger context allows the model to capture longer-range dependencies and paves the way for a WaveNet-like architecture that processes context in a more nuanced, staged fashion rather than in a single oversized leap.

INTRODUCING WAVE-NET-STYLE HIERARCHICAL FUSION

Inspired by WaveNet, the architecture begins fusing information progressively: two characters form a small unit, then two bigrams are fused into four-gram representations, and so on up the hierarchy. The goal is to create a deep, tree-like structure where information is fused gradually rather than crushed in a single hidden layer. This hierarchical fusion makes the network deeper in a meaningful way, aiding the model in utilizing context while maintaining computational efficiency via a dilated/convolutional mindset.

MODULAR LAYERS: EMBEDDING, FLATTENING, AND A PYTORCH-LINEAGE

A shift to modular blocks is described: an embedding module replaces a plain weight lookup, and a flatten module replaces ad-hoc reshaping. These modules imitate PyTorch's Embedding and Flatten concepts, enabling cleaner, reusable code. The embedding maps indices to vectors, and the flatten operation reorganizes the resulting 3D tensor into a form suitable for the next linear layer. This move toward PyTorch-like primitives makes the code easier to reason about and reuse in larger architectures.

BUILDING A SEQUENTIAL CONTAINER TO SIMPLIFY FORWARD PASSES

To avoid scattered, ad-hoc forward passes, a Sequential-like container is introduced. This container stores a list of layers and forwards input through them in order, exposing a clean interface for parameter management and training. The model then becomes a single module that can be invoked as a whole. This mirrors PyTorch’s nn.Sequential, highlighting the value of container abstractions to reduce boilerplate and improve readability while preserving flexibility.

RETHINKING FLATTENING: FROM FLAT TO GROUPED CONCAT

A critical design shift is to stop flattening all context at once. Instead, the code now groups consecutive elements (e.g., pairs) and flattens only those groups, producing shapes such as 4x4x20 rather than 4x80. This enables the first linear layer to operate on smaller, grouped inputs (e.g., eight characters → four groups of two), laying the groundwork for hierarchical, parallel processing of multiple n-grams within each example.

MATRIX MULTIPLICATION ACROSS HIGHER-DIMENSIONS

The discussion emphasizes that PyTorch’s matmul supports higher-dimensional inputs with broadcasting semantics. By reshaping to 4x4x20, the model can multiply along the last dimension (20) while treating the preceding dimensions as batch dimensions. This allows efficient, parallel processing of multiple two-character groups within each example, enabling the neural net to scale context handling without exploding the number of parameters unnecessarily.

BUG FIXES AND DIFFICULTIES: BATCHNORM STATES AND MULTI-DIMENSIONAL INPUTS

A subtle but important issue arises with BatchNorm when inputs have more than one batch dimension. Running statistics were being computed per position (4 separate statistics) instead of per channel (68 channels), leading to incorrect normalization. The fix involved reducing over dimensions 0 and 1 simultaneously (i.e., treating the pair (batch, time) as the batch dimension) so that the running statistics align with the intended channel-wise normalization. This stabilizes training when wider context and deeper structures are used.

TRACEABILITY: USING DEBUG BREAKPOINTS TO VALIDATE SHAPES

The author demonstrates validating forward passes by inserting breakpoints and printing shapes across layers. This practice helps confirm that the embedding, flattening, and linear layers produce tensors of expected shapes as context size grows (e.g., 4x4x20 → 4x4x200). Shape tracing is essential when experimenting with new groupings and higher-dimensional inputs, ensuring numerical correctness and preventing silent shape-related bugs from derailing training.

SCALING EMBEDDINGS AND DEPTH: PERFORMANCE GAINS

After stabilizing the basic architecture, the author explores scaling up. Increasing the embedding size to 24 (creating roughly 76k parameters) and maintaining a hierarchical structure yields a modest yet meaningful improvement: validation loss drops to about 1.993. The trade-off is longer training times and a need for careful hyperparameter tuning. The result confirms that deeper, more expressive representations can meaningfully improve performance even in a relatively simple WaveNet-inspired setup.

TRAINING DYNAMICS AND HYPERPARAMETER TUNING

The narrative emphasizes the lack of an experimental harness and the ad hoc nature of parameter tuning. While performance improves, the author notes that without systematic experiments (e.g., grid searches, learning-rate schedules, ablation studies), it's hard to assess statistical significance. The take-home is that training dynamics matter greatly, and gains can be fragile without robust evaluation protocols and repeatable experiments.

CONVOLUTIONAL ASPIRATIONS: DILATED CAUSAL CONVS AS A VEHICLE

Although not fully implementing gated convolutions, residuals, or skip connections, the talk previews how dilated causal convolutions can offer efficient, space- and time-saving computation for overlapping predictions. The idea is to slide small filters over the input sequence to reuse computations and to capture longer-range dependencies without duplicating effort. This section connects the current hierarchical fusion model to the broader WaveNet machinery and its practical benefits.

FUTURE DIRECTIONS: EXPERIMENTS, RNNs, TRANSFORMERS

Finally, the speaker outlines future avenues: building a complete convolutional network with dilations, exploring residual/skip connections, and creating a rigorous experimental harness for systematic hyperparameter tuning. They also hint at revisiting RNNs, LSTMs, and Transformers as alternative architectures. The overarching message is that this is a learning journey—an incremental exploration toward more powerful and scalable sequence models with robust evaluation.

Mentioned in This Episode

●Software & Apps

●Studies Cited

WaveNet-inspired model cheat sheet

Practical takeaways from this episode

Do This

Use embedding and flatten modules to restructure input before feeding into linear layers.

Prefer a sequential/container approach to organize layers for readability and maintenance.

Gradually fuse context (two-grams, four-grams, etc.) rather than cramming all context into one layer.

If sampling and evaluation are off, switch BatchNorm layers to evaluation mode to use running statistics.

Avoid This

Don’t run BatchNorm in training mode when sampling from the model on a small batch.

Don’t flatten the entire context into a single giant vector before a linear layer; preserve structure with hierarchical fusion.

Don’t rely on default PyTorch flatten semantics when you’re building custom 3D inputs for 1D layers.

Context length experiments and validation loss (illustrative only)

Data extracted from this episode

Configuration	Validation loss
Context length = 3 (initial MLP)	2.10
Context length = 8 (baseline expansion)	2.02
Embedding size = 24 with deeper net	1.993

Common Questions

WaveNet is a neural network architecture using dilated causal convolutions for autoregressive sequence modeling, originally for audio generation. In this video it's used as a conceptual blueprint to build a deeper, progressively fused, tree-like context model for next-character prediction. This helps illustrate a shift from a flat, single-layer approach to a hierarchical one.

Topics

WaveNet Dilated Causal Convolution Character-level Language Model Progressive Fusion Hierarchical Model Embedding Flatten Nn.Sequential Block Size Gated Linear Unit Conv Nets Experimental Workflow

Mentioned in this video

Studies & Research

WaveNet

A dilated, causal-convolution autoregressive model for sequence data (originally audio) used here as inspiration for a hierarchical, progressive fusion architecture.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free