Building makemore Part 3: Activations & Gradients, BatchNorm
Key Moments
Initialization quirks, activations, BatchNorm basics, and practical diagnostics for stable deep nets.
Key Insights
Poor initialization can produce wildly confident but wrong predictions at start (high initial loss); fix by adjusting output bias, scaling weights, and avoiding symmetric zero initialization.
Hidden activations can saturate (e.g., tanh in 10H) leading to vanishing gradients; monitoring activation distributions helps diagnose and mitigate this with careful scaling.
Batch normalization stabilizes training by normalizing activations to zero mean and unit variance per batch, plus learnable gain (gamma) and bias (beta); it also introduces running statistics for inference.
Practical initialization guidelines (fan-in scaling, gain factors from nonlinearities) plus modern techniques (ResNets, Adam) reduce reliance on perfect initialization; BN reduces sensitivity but adds coupling across batch data.
A suite of diagnostics (activation/gradient histograms, saturation metrics, update-to-data ratios) provides actionable feedback for tuning learning rate, BN placement, and network depth for robust training.
INITIALIZATION, LOGITS, AND EARLY TRAINING HICCUPS
The lecture begins by diagnosing why an otherwise simple multi-layer perceptron for character-level modeling starts training with an absurdly high loss. On the very first iteration the loss can spike to about 27, far above the expected cross-entropy baseline (roughly 3.29 for a uniform distribution over 27 classes). This happens because the initial logits are extreme, making the softmax overly confident in incorrect options. The speaker uses a four-character toy example to illustrate that when logits are near zero they produce a reasonable loss, but with some logits large in magnitude, the network can become spuriously confident about wrong choices, inflating the loss. To fix this, the presenter demonstrates a sequence of fixes: zeroing the bias term in the final linear layer so the output logits start near zero, and shrinking the weight matrix W2 (e.g., by 0.1) to push logits toward a near-uniform initial distribution. This avoids the hockey-stick-like early optimization behavior where a few extreme logits dominate and the loss plummets only after the logits are tamed. The discussion also touches on symmetry-breaking concerns—why you typically don’t set all weights to zero even though zero biases could be convenient. After applying these changes, initialization behaves more like the theoretical baseline and early loss drops become more gradual, enabling more productive early optimization cycles rather than fighting an overconfident, miscalibrated start.
ACTIVATION SATURATION AND VANISHING GRADIENTS
The next major issue analyzed is activation saturation, especially for a tanh-activated hidden layer (H) fed by a large preactivation (H preact). The lecture shows histograms revealing that many activations sit near the saturation boundaries of tanh (close to ±1), while the preactivations feed the nonlinearity in a wide range (-5 to 15). In the tanh regime, the derivative (1 - t^2) becomes very small near ±1, causing gradients to vanish as they backpropagate. This can lead to dead neurons (columns where a neuron never activates for any example), severely limiting learning. The talk also notes that ReLU and ELU have different saturation behaviors, but the core problem remains: if activations consistently land in the flat region of the nonlinearity, gradients shrink and learning stalls. A practical remedy is to control the scale of the preactivations early on (e.g., by reducing input-to-hidden weights), preserve some entropy to break symmetry, and avoid overly aggressive saturation across layers. The takeaway is that understanding and monitoring H distributions is crucial, especially when stacking multiple nonlinearities in deeper nets.
INITIALIZATION GUIDELINES AND PRACTICAL TIPS
To tame the scale of activations and ensure stable learning, the lecture revisits principled initialization strategies. A key idea is to scale weights by the inverse square root of the fan-in (1 / sqrt(fan_in)) so that the output variance remains roughly constant as signals propagate through layers. The talk references the Delving Deep into Rectifiers paper, which shows that the gain must be adjusted for different nonlinearities: ReLU benefits from a gain around sqrt(2); for tanh-like nonlinearities a gain around 5/3 is often used. PyTorch’s kaiming (He) initialization provides a practical implementation with a gain parameter that corresponds to the nonlinearity. The lecturer emphasizes that with modern architectures, perfect hand-tuning of initialization is less critical thanks to residual connections, normalization layers, and advanced optimizers (e.g., Adam). In their experiments, a careful but simple adjustment—scaling W1 to around 0.2 or 0.3 and keeping biases near zero at the outset—produces a much more reasonable training trajectory than naive random initialization. The key lesson is to anchor initialization in theory while leveraging practical tools to scale across larger networks.
BATCH NORMALIZATION: HOW IT WORKS, PROS, CONS, AND PRACTICAL IMPLEMENTATION
Batch normalization (BN) is introduced as a transformative innovation for training deep nets. The core idea is to standardize the pre-activation inputs to each layer to zero mean and unit variance across the current batch, then apply a learnable affine transformation via gamma (gain) and beta (bias). This stabilizes activations and helps propagate gradients through many layers, enabling deeper architectures. In practice, BN maintains running estimates of the mean and variance for inference, and training uses batch statistics. The solver also explains the epsilon term to avoid division by zero and the momentum parameter for updating running statistics. The lecture discusses the trade-offs: BN couples different examples within a batch, which can introduce weird dynamics and bugs, but it often yields substantial stability and faster convergence. It also mentions BN’s regularizing effects and how it led to the exploration of alternatives (LayerNorm, GroupNorm) and architectures (ResNets) that mitigate batch coupling. A practical note is to place BN after linear layers (and before nonlinearities) and to disable biases in the preceding linear layers since BN provides its own bias via gamma and beta.
DIAGNOSTICS, PRACTICAL TAKEAWAYS, AND THE ROAD AHEAD
The final portion centers on diagnostic tools and practical takeaways for building stable networks. The lecturer demonstrates a set of diagnostics: activation histograms across layers to detect saturation, gradient histograms to ensure healthy backpropagation, and saturation percentages (e.g., fractions where activations hit tails of the nonlinearity). They also introduce a gradient-to-data ratio and, more usefully, an update-to-data ratio to monitor how much parameter updates change the actual weights each step. If updates overwhelm the parameters, learning rate or normalization strategies may need adjustment. The talk showcases a modular, PyTorch-like approach (torified code) with distinct linear, BN, and nonlinear modules, enabling scalable architectures and easier experimentation. Finally, the discussion foreshadows more advanced topics such as recurrent neural networks, residual blocks, and Transformers, hinting that activation statistics and normalization will become even more critical as networks deepen and specialize. The overall message is that robust training hinges on a toolkit of initialization, normalization, and diagnostic practices rather than ad-hoc tweaks.
Mentioned in This Episode
●Software & Apps
●Tools
●Companies
●Studies Cited
●Concepts
●People Referenced
Triage & Initialization Cheat Sheet
Practical takeaways from this episode
Do This
Avoid This
Effect of initialization fixes on validation loss (example)
Data extracted from this episode
| Condition | Validation Loss (approx) |
|---|---|
| Original (bad init, overconfident softmax) | 2.17 |
| Fix softmax scale (smaller W_out / zero bias) | 2.13 |
| Also fix hidden pre-activation scale (smaller W1/B1) | 2.10 |
Common Questions
Large initial loss often means the model outputs very confident but incorrect probabilities at initialization; check logits distribution — logits should be near-equal (softmax ~ uniform). Re-scale output layer biases/weights or use principled init (gain / sqrt(fan_in).
Topics
Mentioned in this video
PyTorch linear (fully-connected) layer; discussed as the basic building block for MLPs and how fan_in/fan_out determine weight initialization scales.
Residual neural network architecture used as a real example of deep nets where conv -> BatchNorm -> ReLU motifs repeat; referenced to show typical placement of BatchNorm.
PyTorch 1D batch normalization layer; explained in detail (mean/variance calc per-batch, learned scale (gamma) & shift (beta), running stats for inference).
The deep learning framework used throughout; instructor demonstrates how modules (Linear, BatchNorm1d, activations) map to PyTorch APIs.
PyTorch decorator/context manager used to skip building the autograd graph for a block of code (used during evaluation or running-stat updates).
Author of 'Delving Deep into Rectifiers' and contributor to initialization/ResNet work discussed when explaining recommended initialization gains.
PyTorch initializer (Kaiming/He initialization) for weights; described as a common way to set weight std = gain / sqrt(fan_in) depending on nonlinearity.
Paper (He et al.) analyzed initialization for ReLU-like nonlinearities and derived recommended gains (e.g., sqrt(2)) to preserve activation variance.
Stochastic gradient descent (baseline optimizer) used in examples; compared to Adam/RMSprop when discussing practical stability.
More from Andrej Karpathy
View all 14 summaries
132 minHow I use LLMs
212 minDeep Dive into LLMs like ChatGPT
242 minLet's reproduce GPT-2 (124M)
134 minLet's build the GPT Tokenizer
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free