What is the gradient of the cross-entropy loss with softmax with respect to logits?

Analytically, for a single example the gradient is p_i for each class i, and p_y - 1 for the true class y; for a batch you divide by N. This was derived and implemented in the lecture. (timestamp: 5194)

How do you backpropagate through an embedding lookup (indexing)?

You route each gradient for the looked-up embedding back into the corresponding row of the embedding matrix, summing when the same row was used multiple times (scatter-add). The lecture shows a simple for-loop scatter implementation. (timestamp: 4929)

Why subtract the row-wise max from logits before softmax?

Subtracting the max prevents overflow when exponentiating logits and improves numerical stability; the lecture also shows that its backward gradient is effectively zero (up to floating-point noise). (timestamp: 1995)

What's the backward rule for tanh activation?

If a = tanh(z), then dL/dz = (1 - a^2) * dL/da. The instructor demonstrates using the output a in the formula. (timestamp: 3216)

How does broadcasting affect backward passes?

When a tensor was broadcast in the forward pass (e.g., a 1×C bias added to a N×C result), its backward contribution sums across the broadcasted dimension(s). You must sum gradients to match the original shape. (timestamp: 2871)

What is the 'Bessel correction' and why use n-1 in batchnorm?

Dividing by (n-1) yields an unbiased variance estimator for small sample sizes; the lecturer prefers using (n-1) consistently rather than the paper's mix of biased (training) vs unbiased (inference) estimates. (timestamp: 3921)

How do you backpropagate through a matrix multiply?

Given D = A @ B, the gradients are dA = dD @ B^T and dB = A^T @ dD (plus summations for biases). The lecture derives this by writing out a small example and generalizing via shape reasoning. (timestamp: 2871)

How do you check your manual gradients are correct?

Compute the gradients manually and compare with PyTorch's autograd using functions like torch.allclose and checking the maximum absolute difference; the lecture includes a utility check for exact/approx matches. (timestamp: 1030)

Can manually computed gradients match autograd in practice?

Yes — the instructor shows manual gradients matching PyTorch up to tiny floating-point differences (on the order of 1e-9) and uses those checks to validate each backward step. (timestamp: 5799)

Key Moments

Building makemore Part 4: Becoming a Backprop Ninja

Andrej Karpathy

Science & Technology5 min read116 min video

Oct 11, 2022|357,532 views|5,804|426

deep learning backpropagation neural network language model chain rule tensors

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Manual backprop through a 2-layer MLP; replace autograd, derive gradients by hand.

Key Insights

Backprop is a leaky abstraction: autograd hides internals, but understanding the exact gradient flow is crucial for debugging and robustness.

Historical context matters: before autograd, researchers wrote backward passes by hand; modern practice often glosses over details that matter in edge cases.

Gradients through complex ops (softmax, cross-entropy, batch norm) reveal subtle pitfalls like broadcasting, shape mishaps, and numerical stability tricks.

A structured workflow helps: compute and verify gradients step by step against PyTorch grads, using gradient checks and careful shape-handling.

Analytical shortcuts can greatly speed up training: deriving a compact gradient for logits (cross-entropy) can replace long chains of backprop through many atoms.

Batch normalization adds layers of complexity (mean/variance) and requires careful handling of broadcasting, epsilon, and bias terms; the bezels correction discussion highlights real-world choices.

LEAKY ABSTRACTION: WHY HAND-CRAFT BACKPROP MATTERS

The lecture centers on removing PyTorch’s automatic gradient engine for a two-layer MLP with batch normalization, in order to write the backward pass by hand. The presenter argues that backprop is a leaky abstraction: while autograd makes training feel magical, it hides the actual gradient flow and can obscure subtle pitfalls like gradient saturation, dead neurons, and exploding/vanishing gradients. By tracing gradients through tensors rather than scalars, one gains the debugging clarity needed to fix bugs, reason about numerical stability, and build intuition for how signals propagate from loss back to parameters. The talk also situates this practice historically, noting that early DL work required manual backward passes, and that even now, understanding the internals aids in debugging, tuning, and preventing silent failures when gradients look superficially correct but behave poorly in practice.

REBUILDING THE BACKPROP ENGINE: SETUP AND GOALS

The notebook mirrors a familiar setup: a two-layer multilayer perceptron with a batch normalization layer, forward pass unchanged, but backward pass rewritten from scratch. The goal is to compute and compare manual gradients with PyTorch’s autograd, ensuring exact or near-exact agreement. To promote error detection, the author initializes weights with small random values (to reveal gradient bugs) and retains intermediate tensors to backprop through each step. The plan unfolds in progressive exercises: first reconstruct the full graph’s gradients, then derive a closed-form gradient for logits, then derive batch norm gradients analytically, and finally assemble everything into a complete hand-rolled training loop.

EXERCISE 1: RECONSTRUCTING THE GRADIENT THROUGH THE FULL GRAPH

In the first exercise, the learner derives the derivative of the loss with respect to every element of the log-probability tensor (log_props). The size 32x27 aligns with a batch of 32 and 27 possible next characters. The gradient with respect to log_probs is nonzero only where the correct class was selected; the gradient pattern is negative one over the batch size on those positions, and zero elsewhere. The exercise includes a rigorous check: compare the hand-computed gradient against PyTorch’s grad by using a gradient checker utility and ensure exact or near-exact equality. This establishes confidence in the hand-rolled chain rule across a nontrivial graph.

EXERCISE 2: DERIVING GRADIENTS FOR LOGITS WITH CROSS-ENTROPY

Moving from the atom-by-atom backprop, the learner derives an analytic gradient of the loss with respect to the logits, bypassing the long chain of sub-operations. By focusing on a single example and using the softmax formula, the slide derives a compact expression for dL/dlogits: the gradient is the softmax probabilities, with a subtract-one offset at the correct class, all averaged across the batch. Implementing this yields a clean, efficient, and numerically stable backprop step, matching PyTorch up to tiny floating-point differences and providing a strong intuition for how cross-entropy drives learning.

CROSS-ENTROPY INTUITION: FORCES OF PULL AND PULLBACK

The narrative deeply explores the intuition behind the cross-entropy gradient. In a row of logits, the gradient signals a pull away from incorrect options and a push toward the correct one, scaled by the softmax probabilities. The probability row sums to one, so the total gradient in each row sums to zero, preserving a balance of forces. This mental model — a network of “pulls” and “pushes” on logits — helps explain why confident mispredictions produce large gradient updates while near-correct predictions yield smaller adjustments. It also clarifies why the gradient is distributed across the 27 options in a row.

EXERCISE 3: HAND-CRAFTING BATCH-NORM BACKWARD PASS

Batch normalization introduces additional complexity due to mean and variance, running statistics, and broadcasting. Here the backward pass focuses on the input gradient through the BN transform, while acknowledging that many tutorials leave BN’s full chain complicated. The discussion covers the subtle bezels correction (bias correction) used when estimating variance, the role of epsilon for numerical stability, and the impact of training vs inference modes. The implementation derives a compact, correct expression for the BN backward pass, including how gradients must be accumulated across the batch and broadcasted to each feature dimension.

EXERCISE 4: INTEGRATING AND VALIDATING HAND-CRAFTED GRADIENTS

The final exercise integrates all hand-derived pieces: the backward through the entire network, the cross-entropy loss, the batch-norm, and the embedding/linear layers. The learner switches to a no_grad context and compares the hand-derived gradient updates against PyTorch’s autograd, iterating until the gradients converge to a tiny difference. The practitioner demonstrates that the same optimization dynamics can be achieved with hand-made gradients, reinforcing confidence in understanding and exposing the internal mechanics that PyTorch abstracts away.

HISTORICAL CONTEXT AND PRACTICAL TAKEAWAYS

A closing thread reflects on history and practical lessons: 10+ years ago, manual backward passes were ubiquitous; today, autograd dominates, but the exercise reveals why understanding gradients matters for debugging, correctness, and optimization. The talk foreshadows moving toward recurrent nets (LSTMs) and more complex architectures, building a solid foundation for understanding gradient flow across diverse layers. The key takeaway is not to shun autograd, but to internalize how gradients flow, how to diagnose subtle issues, and how to design numerically stable and efficient backward passes when needed.

Mentioned in This Episode

●Software & Apps

●Tools

●Studies Cited

●Concepts

Common Questions

The instructor argues manual backprop is a valuable exercise: it reveals how gradients flow, helps debug subtle bugs (e.g., clipped losses or broadcasting errors), and builds intuition that autograd can hide. (timestamp: 38)

Topics

Manual Backprop Tanh Activation Matrix Multiplication Gradients Embedding Lookup Numerical Stability Gradient Checking Broadcasting Bessel Correction

Mentioned in this video

Software & Apps

PyTorch autograd (loss.backward)

The built-in automatic differentiation mechanism in PyTorch; the lecturer argues for replacing calls to loss.backward with a manual tensor-level backward pass for pedagogical and debugging reasons.

softmax / cross-entropy gradient (analytic formula)

Analytically derived gradient of cross-entropy w.r.t. logits: p - 1_{y} (scaled by 1/N for batch).

matrix multiplication gradients (A @ B rules)

The standard backward relationships for D = A @ B + C (dA = dD @ B.T, dB = A.T @ dD, dC = sum(dD)), derived by hand from small examples.

tanh (hyperbolic tangent)

The activation function used in the MLP; its backward rule (1 - a^2) is derived and used in the manual backward pass.

torch.zeros_like / torch.allclose

Utility PyTorch functions used in gradient-checking and to initialize derivative tensors in manual backprop.

embedding lookup / indexing

The mechanism for turning integer token indices into embedded vectors; backward involves scattering and summing gradients into the embedding matrix rows used.

one-hot (F.one_hot)

Used as an implementation trick to route gradients when scattering back from max/indices operations.

Studies & Research

Restricted Boltzmann Machines (RBMs)

2006 work by Hinton and collaborators was cited as an example of research where gradients and updates were explicitly implemented by hand.

Bessel's correction (divide by n-1)

Statistical correction discussed in the context of batch normalization variance estimation (unbiased estimator dividing by n-1 vs biased n).

contrastive divergence

A training method (used historically for RBMs) that was mentioned to illustrate earlier practices of computing/using gradients directly instead of autograd.

fragmented embeddings (2014 paper)

The instructor's 2014 paper on aligning image fragments and text fragments; used as an example where the backward pass was implemented manually.

Concepts

numerical stability trick: subtract max(logits)

Subtracting per-row max from logits before exponentiation to avoid overflow; its backward contribution is small/near-zero and discussed in detail.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free