Key Moments
Building makemore Part 4: Becoming a Backprop Ninja
Key Moments
Manual backprop through a 2-layer MLP; replace autograd, derive gradients by hand.
Key Insights
Backprop is a leaky abstraction: autograd hides internals, but understanding the exact gradient flow is crucial for debugging and robustness.
Historical context matters: before autograd, researchers wrote backward passes by hand; modern practice often glosses over details that matter in edge cases.
Gradients through complex ops (softmax, cross-entropy, batch norm) reveal subtle pitfalls like broadcasting, shape mishaps, and numerical stability tricks.
A structured workflow helps: compute and verify gradients step by step against PyTorch grads, using gradient checks and careful shape-handling.
Analytical shortcuts can greatly speed up training: deriving a compact gradient for logits (cross-entropy) can replace long chains of backprop through many atoms.
Batch normalization adds layers of complexity (mean/variance) and requires careful handling of broadcasting, epsilon, and bias terms; the bezels correction discussion highlights real-world choices.
LEAKY ABSTRACTION: WHY HAND-CRAFT BACKPROP MATTERS
The lecture centers on removing PyTorch’s automatic gradient engine for a two-layer MLP with batch normalization, in order to write the backward pass by hand. The presenter argues that backprop is a leaky abstraction: while autograd makes training feel magical, it hides the actual gradient flow and can obscure subtle pitfalls like gradient saturation, dead neurons, and exploding/vanishing gradients. By tracing gradients through tensors rather than scalars, one gains the debugging clarity needed to fix bugs, reason about numerical stability, and build intuition for how signals propagate from loss back to parameters. The talk also situates this practice historically, noting that early DL work required manual backward passes, and that even now, understanding the internals aids in debugging, tuning, and preventing silent failures when gradients look superficially correct but behave poorly in practice.
REBUILDING THE BACKPROP ENGINE: SETUP AND GOALS
The notebook mirrors a familiar setup: a two-layer multilayer perceptron with a batch normalization layer, forward pass unchanged, but backward pass rewritten from scratch. The goal is to compute and compare manual gradients with PyTorch’s autograd, ensuring exact or near-exact agreement. To promote error detection, the author initializes weights with small random values (to reveal gradient bugs) and retains intermediate tensors to backprop through each step. The plan unfolds in progressive exercises: first reconstruct the full graph’s gradients, then derive a closed-form gradient for logits, then derive batch norm gradients analytically, and finally assemble everything into a complete hand-rolled training loop.
EXERCISE 1: RECONSTRUCTING THE GRADIENT THROUGH THE FULL GRAPH
In the first exercise, the learner derives the derivative of the loss with respect to every element of the log-probability tensor (log_props). The size 32x27 aligns with a batch of 32 and 27 possible next characters. The gradient with respect to log_probs is nonzero only where the correct class was selected; the gradient pattern is negative one over the batch size on those positions, and zero elsewhere. The exercise includes a rigorous check: compare the hand-computed gradient against PyTorch’s grad by using a gradient checker utility and ensure exact or near-exact equality. This establishes confidence in the hand-rolled chain rule across a nontrivial graph.
EXERCISE 2: DERIVING GRADIENTS FOR LOGITS WITH CROSS-ENTROPY
Moving from the atom-by-atom backprop, the learner derives an analytic gradient of the loss with respect to the logits, bypassing the long chain of sub-operations. By focusing on a single example and using the softmax formula, the slide derives a compact expression for dL/dlogits: the gradient is the softmax probabilities, with a subtract-one offset at the correct class, all averaged across the batch. Implementing this yields a clean, efficient, and numerically stable backprop step, matching PyTorch up to tiny floating-point differences and providing a strong intuition for how cross-entropy drives learning.
CROSS-ENTROPY INTUITION: FORCES OF PULL AND PULLBACK
The narrative deeply explores the intuition behind the cross-entropy gradient. In a row of logits, the gradient signals a pull away from incorrect options and a push toward the correct one, scaled by the softmax probabilities. The probability row sums to one, so the total gradient in each row sums to zero, preserving a balance of forces. This mental model — a network of “pulls” and “pushes” on logits — helps explain why confident mispredictions produce large gradient updates while near-correct predictions yield smaller adjustments. It also clarifies why the gradient is distributed across the 27 options in a row.
EXERCISE 3: HAND-CRAFTING BATCH-NORM BACKWARD PASS
Batch normalization introduces additional complexity due to mean and variance, running statistics, and broadcasting. Here the backward pass focuses on the input gradient through the BN transform, while acknowledging that many tutorials leave BN’s full chain complicated. The discussion covers the subtle bezels correction (bias correction) used when estimating variance, the role of epsilon for numerical stability, and the impact of training vs inference modes. The implementation derives a compact, correct expression for the BN backward pass, including how gradients must be accumulated across the batch and broadcasted to each feature dimension.
EXERCISE 4: INTEGRATING AND VALIDATING HAND-CRAFTED GRADIENTS
The final exercise integrates all hand-derived pieces: the backward through the entire network, the cross-entropy loss, the batch-norm, and the embedding/linear layers. The learner switches to a no_grad context and compares the hand-derived gradient updates against PyTorch’s autograd, iterating until the gradients converge to a tiny difference. The practitioner demonstrates that the same optimization dynamics can be achieved with hand-made gradients, reinforcing confidence in understanding and exposing the internal mechanics that PyTorch abstracts away.
HISTORICAL CONTEXT AND PRACTICAL TAKEAWAYS
A closing thread reflects on history and practical lessons: 10+ years ago, manual backward passes were ubiquitous; today, autograd dominates, but the exercise reveals why understanding gradients matters for debugging, correctness, and optimization. The talk foreshadows moving toward recurrent nets (LSTMs) and more complex architectures, building a solid foundation for understanding gradient flow across diverse layers. The key takeaway is not to shun autograd, but to internalize how gradients flow, how to diagnose subtle issues, and how to design numerically stable and efficient backward passes when needed.
Mentioned in This Episode
●Software & Apps
●Tools
●Studies Cited
●Concepts
Common Questions
The instructor argues manual backprop is a valuable exercise: it reveals how gradients flow, helps debug subtle bugs (e.g., clipped losses or broadcasting errors), and builds intuition that autograd can hide. (timestamp: 38)
Topics
Mentioned in this video
The built-in automatic differentiation mechanism in PyTorch; the lecturer argues for replacing calls to loss.backward with a manual tensor-level backward pass for pedagogical and debugging reasons.
Analytically derived gradient of cross-entropy w.r.t. logits: p - 1_{y} (scaled by 1/N for batch).
The standard backward relationships for D = A @ B + C (dA = dD @ B.T, dB = A.T @ dD, dC = sum(dD)), derived by hand from small examples.
The activation function used in the MLP; its backward rule (1 - a^2) is derived and used in the manual backward pass.
Utility PyTorch functions used in gradient-checking and to initialize derivative tensors in manual backprop.
The mechanism for turning integer token indices into embedded vectors; backward involves scattering and summing gradients into the embedding matrix rows used.
Used as an implementation trick to route gradients when scattering back from max/indices operations.
Subtracting per-row max from logits before exponentiation to avoid overflow; its backward contribution is small/near-zero and discussed in detail.
2006 work by Hinton and collaborators was cited as an example of research where gradients and updates were explicitly implemented by hand.
Statistical correction discussed in the context of batch normalization variance estimation (unbiased estimator dividing by n-1 vs biased n).
A training method (used historically for RBMs) that was mentioned to illustrate earlier practices of computing/using gradients directly instead of autograd.
The instructor's 2014 paper on aligning image fragments and text fragments; used as an example where the backward pass was implemented manually.
More from Andrej Karpathy
View all 14 summaries
132 minHow I use LLMs
212 minDeep Dive into LLMs like ChatGPT
242 minLet's reproduce GPT-2 (124M)
134 minLet's build the GPT Tokenizer
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free