Key Moments
Building makemore Part 5: Building a WaveNet
Key Moments
WaveNet-inspired hierarchical fusion for character prediction with PyTorch-like modules.
Key Insights
Increasing context length from 3 to 8 improves validation loss (2.10 -> ~2.02) and sample quality.
The project progresses from a flat MLP to modular building blocks (embedding, flatten, sequential) inspired by PyTorch.
A hierarchical fusion approach is introduced, progressively combining context (2 chars -> 4, 8, etc.) to form deeper representations.
BatchNorm behavior across multiple batch dimensions is subtle and bug-prone; fixing it improves stability and performance.
Using a Sequential-like container simplifies forward passes and keeps parameter management clean and scalable.
Larger embeddings and deeper/hierarchical networks yield further gains (e.g., ~1.99 validation) but require longer training and tuning.
EMBRACING A LONGER CONTEXT: FROM 3-CHAR TO 8-CHAR BLOCKS
The video describes a shift from predicting the next character using a tiny window of three characters to a broader, eight-character block. This change amplifies the input dimensionality and alters the first linear layer, increasing the parameter count by roughly 10k. The larger context allows the model to capture longer-range dependencies and paves the way for a WaveNet-like architecture that processes context in a more nuanced, staged fashion rather than in a single oversized leap.
INTRODUCING WAVE-NET-STYLE HIERARCHICAL FUSION
Inspired by WaveNet, the architecture begins fusing information progressively: two characters form a small unit, then two bigrams are fused into four-gram representations, and so on up the hierarchy. The goal is to create a deep, tree-like structure where information is fused gradually rather than crushed in a single hidden layer. This hierarchical fusion makes the network deeper in a meaningful way, aiding the model in utilizing context while maintaining computational efficiency via a dilated/convolutional mindset.
MODULAR LAYERS: EMBEDDING, FLATTENING, AND A PYTORCH-LINEAGE
A shift to modular blocks is described: an embedding module replaces a plain weight lookup, and a flatten module replaces ad-hoc reshaping. These modules imitate PyTorch's Embedding and Flatten concepts, enabling cleaner, reusable code. The embedding maps indices to vectors, and the flatten operation reorganizes the resulting 3D tensor into a form suitable for the next linear layer. This move toward PyTorch-like primitives makes the code easier to reason about and reuse in larger architectures.
BUILDING A SEQUENTIAL CONTAINER TO SIMPLIFY FORWARD PASSES
To avoid scattered, ad-hoc forward passes, a Sequential-like container is introduced. This container stores a list of layers and forwards input through them in order, exposing a clean interface for parameter management and training. The model then becomes a single module that can be invoked as a whole. This mirrors PyTorch’s nn.Sequential, highlighting the value of container abstractions to reduce boilerplate and improve readability while preserving flexibility.
RETHINKING FLATTENING: FROM FLAT TO GROUPED CONCAT
A critical design shift is to stop flattening all context at once. Instead, the code now groups consecutive elements (e.g., pairs) and flattens only those groups, producing shapes such as 4x4x20 rather than 4x80. This enables the first linear layer to operate on smaller, grouped inputs (e.g., eight characters → four groups of two), laying the groundwork for hierarchical, parallel processing of multiple n-grams within each example.
MATRIX MULTIPLICATION ACROSS HIGHER-DIMENSIONS
The discussion emphasizes that PyTorch’s matmul supports higher-dimensional inputs with broadcasting semantics. By reshaping to 4x4x20, the model can multiply along the last dimension (20) while treating the preceding dimensions as batch dimensions. This allows efficient, parallel processing of multiple two-character groups within each example, enabling the neural net to scale context handling without exploding the number of parameters unnecessarily.
BUG FIXES AND DIFFICULTIES: BATCHNORM STATES AND MULTI-DIMENSIONAL INPUTS
A subtle but important issue arises with BatchNorm when inputs have more than one batch dimension. Running statistics were being computed per position (4 separate statistics) instead of per channel (68 channels), leading to incorrect normalization. The fix involved reducing over dimensions 0 and 1 simultaneously (i.e., treating the pair (batch, time) as the batch dimension) so that the running statistics align with the intended channel-wise normalization. This stabilizes training when wider context and deeper structures are used.
TRACEABILITY: USING DEBUG BREAKPOINTS TO VALIDATE SHAPES
The author demonstrates validating forward passes by inserting breakpoints and printing shapes across layers. This practice helps confirm that the embedding, flattening, and linear layers produce tensors of expected shapes as context size grows (e.g., 4x4x20 → 4x4x200). Shape tracing is essential when experimenting with new groupings and higher-dimensional inputs, ensuring numerical correctness and preventing silent shape-related bugs from derailing training.
SCALING EMBEDDINGS AND DEPTH: PERFORMANCE GAINS
After stabilizing the basic architecture, the author explores scaling up. Increasing the embedding size to 24 (creating roughly 76k parameters) and maintaining a hierarchical structure yields a modest yet meaningful improvement: validation loss drops to about 1.993. The trade-off is longer training times and a need for careful hyperparameter tuning. The result confirms that deeper, more expressive representations can meaningfully improve performance even in a relatively simple WaveNet-inspired setup.
TRAINING DYNAMICS AND HYPERPARAMETER TUNING
The narrative emphasizes the lack of an experimental harness and the ad hoc nature of parameter tuning. While performance improves, the author notes that without systematic experiments (e.g., grid searches, learning-rate schedules, ablation studies), it's hard to assess statistical significance. The take-home is that training dynamics matter greatly, and gains can be fragile without robust evaluation protocols and repeatable experiments.
CONVOLUTIONAL ASPIRATIONS: DILATED CAUSAL CONVS AS A VEHICLE
Although not fully implementing gated convolutions, residuals, or skip connections, the talk previews how dilated causal convolutions can offer efficient, space- and time-saving computation for overlapping predictions. The idea is to slide small filters over the input sequence to reuse computations and to capture longer-range dependencies without duplicating effort. This section connects the current hierarchical fusion model to the broader WaveNet machinery and its practical benefits.
FUTURE DIRECTIONS: EXPERIMENTS, RNNs, TRANSFORMERS
Finally, the speaker outlines future avenues: building a complete convolutional network with dilations, exploring residual/skip connections, and creating a rigorous experimental harness for systematic hyperparameter tuning. They also hint at revisiting RNNs, LSTMs, and Transformers as alternative architectures. The overarching message is that this is a learning journey—an incremental exploration toward more powerful and scalable sequence models with robust evaluation.
Mentioned in This Episode
●Software & Apps
●Studies Cited
WaveNet-inspired model cheat sheet
Practical takeaways from this episode
Do This
Avoid This
Context length experiments and validation loss (illustrative only)
Data extracted from this episode
| Configuration | Validation loss |
|---|---|
| Context length = 3 (initial MLP) | 2.10 |
| Context length = 8 (baseline expansion) | 2.02 |
| Embedding size = 24 with deeper net | 1.993 |
Common Questions
WaveNet is a neural network architecture using dilated causal convolutions for autoregressive sequence modeling, originally for audio generation. In this video it's used as a conceptual blueprint to build a deeper, progressively fused, tree-like context model for next-character prediction. This helps illustrate a shift from a flat, single-layer approach to a hierarchical one.
Topics
Mentioned in this video
More from Andrej Karpathy
View all 14 summaries
132 minHow I use LLMs
212 minDeep Dive into LLMs like ChatGPT
242 minLet's reproduce GPT-2 (124M)
134 minLet's build the GPT Tokenizer
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free