How do I pick the weight initialization scale for a linear layer?

A principled default is to draw weights with std = gain / sqrt(fan_in), where 'gain' depends on the nonlinearity (e.g., 1 for linear, sqrt(2) for ReLU; PyTorch's kaiming init automates this).

What causes 'dead' neurons and how can I detect them?

Dead neurons occur when a unit's pre-activation always lies in a flat region (e.g., tanh/Relu tails) so gradients are zero. Diagnose by computing percent-saturation (e.g., |tanh|>0.97) across the batch and looking for columns that never activate.

When should I use Batch Normalization in my network?

BatchNorm is useful after linear or convolutional layers to stabilize per-layer activation statistics, especially in deep nets. It standardizes batch activations and then learns a per-channel scale (gamma) and shift (beta) so the network can recover any needed distribution.

How does BatchNorm work at inference time for single examples?

During training BatchNorm uses batch statistics; for inference it uses running (exponential moving average) estimates of mean and variance accumulated during training, so you can run single examples deterministically.

Should I keep biases in layers followed by BatchNorm?

No — biases in layers immediately before BatchNorm are effectively canceled by the mean-centering step and waste parameters; prefer bias=False for those layers and rely on BatchNorm's learned beta for shifts.

How can I choose an appropriate learning rate without guessing?

Monitor the update-to-data ratio (std(update)/std(param)); a common heuristic is that updates should be ~1e-3 of parameter magnitudes (log10 ≈ -3). If ratios are far below that, increase the LR; if far above, decrease it.

Why do modern networks often use Adam or RMSprop instead of plain SGD?

Adaptive optimizers like Adam and RMSprop reduce sensitivity to careful initialization and per-parameter learning-rate tuning, often making training more robust for deep networks compared to vanilla SGD.

Key Moments

Building makemore Part 3: Activations & Gradients, BatchNorm

Andrej Karpathy

Science & Technology5 min read116 min video

Oct 4, 2022|481,439 views|8,285|457

neural network deep learning makemore batchnorm batch normalization pytorch

Save to Pod

Key Moments

TL;DR

Initialization quirks, activations, BatchNorm basics, and practical diagnostics for stable deep nets.

Key Insights

Poor initialization can produce wildly confident but wrong predictions at start (high initial loss); fix by adjusting output bias, scaling weights, and avoiding symmetric zero initialization.

Hidden activations can saturate (e.g., tanh in 10H) leading to vanishing gradients; monitoring activation distributions helps diagnose and mitigate this with careful scaling.

Batch normalization stabilizes training by normalizing activations to zero mean and unit variance per batch, plus learnable gain (gamma) and bias (beta); it also introduces running statistics for inference.

Practical initialization guidelines (fan-in scaling, gain factors from nonlinearities) plus modern techniques (ResNets, Adam) reduce reliance on perfect initialization; BN reduces sensitivity but adds coupling across batch data.

A suite of diagnostics (activation/gradient histograms, saturation metrics, update-to-data ratios) provides actionable feedback for tuning learning rate, BN placement, and network depth for robust training.

INITIALIZATION, LOGITS, AND EARLY TRAINING HICCUPS

The lecture begins by diagnosing why an otherwise simple multi-layer perceptron for character-level modeling starts training with an absurdly high loss. On the very first iteration the loss can spike to about 27, far above the expected cross-entropy baseline (roughly 3.29 for a uniform distribution over 27 classes). This happens because the initial logits are extreme, making the softmax overly confident in incorrect options. The speaker uses a four-character toy example to illustrate that when logits are near zero they produce a reasonable loss, but with some logits large in magnitude, the network can become spuriously confident about wrong choices, inflating the loss. To fix this, the presenter demonstrates a sequence of fixes: zeroing the bias term in the final linear layer so the output logits start near zero, and shrinking the weight matrix W2 (e.g., by 0.1) to push logits toward a near-uniform initial distribution. This avoids the hockey-stick-like early optimization behavior where a few extreme logits dominate and the loss plummets only after the logits are tamed. The discussion also touches on symmetry-breaking concerns—why you typically don’t set all weights to zero even though zero biases could be convenient. After applying these changes, initialization behaves more like the theoretical baseline and early loss drops become more gradual, enabling more productive early optimization cycles rather than fighting an overconfident, miscalibrated start.

ACTIVATION SATURATION AND VANISHING GRADIENTS

The next major issue analyzed is activation saturation, especially for a tanh-activated hidden layer (H) fed by a large preactivation (H preact). The lecture shows histograms revealing that many activations sit near the saturation boundaries of tanh (close to ±1), while the preactivations feed the nonlinearity in a wide range (-5 to 15). In the tanh regime, the derivative (1 - t^2) becomes very small near ±1, causing gradients to vanish as they backpropagate. This can lead to dead neurons (columns where a neuron never activates for any example), severely limiting learning. The talk also notes that ReLU and ELU have different saturation behaviors, but the core problem remains: if activations consistently land in the flat region of the nonlinearity, gradients shrink and learning stalls. A practical remedy is to control the scale of the preactivations early on (e.g., by reducing input-to-hidden weights), preserve some entropy to break symmetry, and avoid overly aggressive saturation across layers. The takeaway is that understanding and monitoring H distributions is crucial, especially when stacking multiple nonlinearities in deeper nets.

INITIALIZATION GUIDELINES AND PRACTICAL TIPS

To tame the scale of activations and ensure stable learning, the lecture revisits principled initialization strategies. A key idea is to scale weights by the inverse square root of the fan-in (1 / sqrt(fan_in)) so that the output variance remains roughly constant as signals propagate through layers. The talk references the Delving Deep into Rectifiers paper, which shows that the gain must be adjusted for different nonlinearities: ReLU benefits from a gain around sqrt(2); for tanh-like nonlinearities a gain around 5/3 is often used. PyTorch’s kaiming (He) initialization provides a practical implementation with a gain parameter that corresponds to the nonlinearity. The lecturer emphasizes that with modern architectures, perfect hand-tuning of initialization is less critical thanks to residual connections, normalization layers, and advanced optimizers (e.g., Adam). In their experiments, a careful but simple adjustment—scaling W1 to around 0.2 or 0.3 and keeping biases near zero at the outset—produces a much more reasonable training trajectory than naive random initialization. The key lesson is to anchor initialization in theory while leveraging practical tools to scale across larger networks.

BATCH NORMALIZATION: HOW IT WORKS, PROS, CONS, AND PRACTICAL IMPLEMENTATION

Batch normalization (BN) is introduced as a transformative innovation for training deep nets. The core idea is to standardize the pre-activation inputs to each layer to zero mean and unit variance across the current batch, then apply a learnable affine transformation via gamma (gain) and beta (bias). This stabilizes activations and helps propagate gradients through many layers, enabling deeper architectures. In practice, BN maintains running estimates of the mean and variance for inference, and training uses batch statistics. The solver also explains the epsilon term to avoid division by zero and the momentum parameter for updating running statistics. The lecture discusses the trade-offs: BN couples different examples within a batch, which can introduce weird dynamics and bugs, but it often yields substantial stability and faster convergence. It also mentions BN’s regularizing effects and how it led to the exploration of alternatives (LayerNorm, GroupNorm) and architectures (ResNets) that mitigate batch coupling. A practical note is to place BN after linear layers (and before nonlinearities) and to disable biases in the preceding linear layers since BN provides its own bias via gamma and beta.

DIAGNOSTICS, PRACTICAL TAKEAWAYS, AND THE ROAD AHEAD

The final portion centers on diagnostic tools and practical takeaways for building stable networks. The lecturer demonstrates a set of diagnostics: activation histograms across layers to detect saturation, gradient histograms to ensure healthy backpropagation, and saturation percentages (e.g., fractions where activations hit tails of the nonlinearity). They also introduce a gradient-to-data ratio and, more usefully, an update-to-data ratio to monitor how much parameter updates change the actual weights each step. If updates overwhelm the parameters, learning rate or normalization strategies may need adjustment. The talk showcases a modular, PyTorch-like approach (torified code) with distinct linear, BN, and nonlinear modules, enabling scalable architectures and easier experimentation. Finally, the discussion foreshadows more advanced topics such as recurrent neural networks, residual blocks, and Transformers, hinting that activation statistics and normalization will become even more critical as networks deepen and specialize. The overall message is that robust training hinges on a toolkit of initialization, normalization, and diagnostic practices rather than ad-hoc tweaks.

Mentioned in This Episode

●Software & Apps

●Tools

●Companies

●Studies Cited

●Concepts

●People Referenced

Triage & Initialization Cheat Sheet

Practical takeaways from this episode

Do This

Do initialize biases to zero at the output layer and use small random weights (std ≈ gain / sqrt(fan_in)).

Do check initial loss against the theoretical uniform-loss baseline to spot overconfident logits (use negative log prob of 1/Vocab).

Do plot activation histograms and percent-saturation (e.g., |tanh| > 0.97) to detect saturated/vanishing gradients.

Do use BatchNorm (or other normalization) after linear/conv layers to stabilize activations across depth and accelerate training.

Do track parameter update-to-data ratio (std(update)/std(param)) to pick a sensible learning rate (target typical log10 ≈ -3).

Avoid This

Don't initialize critical weight matrices with large random biases that make logits extremely confident at start.

Don't leave biases enabled in a linear/conv layer immediately followed by BatchNorm (they will be redundant).

Don't ignore per-layer gradient/activation diagnostics — deep nets can fail silently if some layers saturate.

Don't fixate on a single initialization heuristic; combine principled init with modern optimizers and normalization.

Effect of initialization fixes on validation loss (example)

Data extracted from this episode

Condition	Validation Loss (approx)
Original (bad init, overconfident softmax)	2.17
Fix softmax scale (smaller W_out / zero bias)	2.13
Also fix hidden pre-activation scale (smaller W1/B1)	2.10

Common Questions

Large initial loss often means the model outputs very confident but incorrect probabilities at initialization; check logits distribution — logits should be near-equal (softmax ~ uniform). Re-scale output layer biases/weights or use principled init (gain / sqrt(fan_in).

Topics

Initialization Weight Scaling Fan_in Kaiming / He Init Softmax Tanh Saturation Dead Neurons Batch Normalization Running Statistics ResNet Diagnostics Update-to-data Ratio Optimizers

Mentioned in this video

Software & Apps

torch.nn.Linear

PyTorch linear (fully-connected) layer; discussed as the basic building block for MLPs and how fan_in/fan_out determine weight initialization scales.

ResNet

Residual neural network architecture used as a real example of deep nets where conv -> BatchNorm -> ReLU motifs repeat; referenced to show typical placement of BatchNorm.

torch.nn.BatchNorm1d

PyTorch 1D batch normalization layer; explained in detail (mean/variance calc per-batch, learned scale (gamma) & shift (beta), running stats for inference).

PyTorch (torch.nn)

The deep learning framework used throughout; instructor demonstrates how modules (Linear, BatchNorm1d, activations) map to PyTorch APIs.

torch.nn.init.kaiming_normal_

PyTorch initializer (Kaiming/He initialization) for weights; described as a common way to set weight std = gain / sqrt(fan_in) depending on nonlinearity.

SGD

Stochastic gradient descent (baseline optimizer) used in examples; compared to Adam/RMSprop when discussing practical stability.

People

Kaiming He

Author of 'Delving Deep into Rectifiers' and contributor to initialization/ResNet work discussed when explaining recommended initialization gains.

Adam

Studies & Research

Delving Deep into Rectifiers

Paper (He et al.) analyzed initialization for ReLU-like nonlinearities and derived recommended gains (e.g., sqrt(2)) to preserve activation variance.