The spelled-out intro to neural networks and backpropagation: building micrograd

Andrej KarpathyAndrej Karpathy
Science & Technology3 min read146 min video
Aug 16, 2022|3,277,648 views|66,089|2,438
Save to Pod

Key Moments

TL;DR

Build a toy autograd (micrograd) from scratch and see backprop in action.

Key Insights

1

Autograd computes gradients by traversing a computation graph with the chain rule.

2

Micrograd uses a simple Value class to wrap scalars and overloads operators to build graphs.

3

Forward pass computes outputs; backward pass computes gradients w.r.t inputs and weights, through topological sort.

4

Visualization and gradient checks (numerical gradients) help verify backprop correctness.

5

Real-world libraries (PyTorch) implement similar ideas with tensors and modules; micrograd illustrates concepts in ~150 lines.

6

Training a neural net reduces to assembling a small expression (neuron, layer, MLP), a loss, backward pass, then gradient descent; pitfalls include zeroing grads and learning rate choices.

INTRO TO MICROGRAD AND AUTOGRAD

André presents micrograd as a tiny autograd engine designed to reveal training mechanics under the hood. The aim is to build a scalar computation graph that can be forward-evaluated and then backpropagated to obtain gradients with respect to inputs and weights. The core idea—that modern libraries rest on automatic differentiation—gets distilled into an approachable, minimal implementation. The talk emphasizes intuition over tensors, and shows that roughly 150 lines of Python can capture the essential autograd behavior.

BUILDING BLOCKS: THE VALUE CLASS AND COMPUTATION GRAPHS

The foundation is the Value class, which wraps a single scalar value, its gradient, and pointers to the values that contributed to it. Operators like add and mul are overloaded so expressions naturally form a graph: each new Value records its children and the operation that produced it. Forward passes compute the final output, while backward passes propagate gradients by following the graph in reverse, enabling backpropagation through the chain rule. This section also introduces graph visualization to see how a simple expression unfolds.

BACKPROPAGATION AND THE CHAIN RULE IN ACTION

Backpropagation is implemented by applying the chain rule along a topologically sorted graph. The gradient at the output starts at 1 and flows backward, with each node applying its local derivative and distributing gradients to its inputs. Accumulation via plus-equals handles shared inputs that contribute through multiple paths. The discussion covers simple ops (addition with local derivative 1, multiplication with the other input), as well as nonlinear nodes like tanh, underscoring how local derivatives knit together into global gradients.

FROM NEURONS TO MLPS: TRAINING WITH FORWARD, LOSS, BACKPROP, AND UPDATE

Micrograd scales up from scalars to neural-net building blocks: Neuron, Layer, and Multi-Layer Perceptron (MLP). André demonstrates a tiny binary classification dataset, defines a mean-squared-error loss, and walks through forward, backward, and gradient-descent updates. A subtle but instructive bug—failing to zero gradients between iterations—is highlighted, illustrating practical pitfalls in training loops. The example shows how even a small network with ~40 parameters can embody the full training pipeline.

MICROGRAD VS PYTORCH: INSIGHTS, LIMITS, AND TAKEAWAYS

The talk juxtaposes micrograd with PyTorch: autograd, tensors, and backward passes are the same core ideas, but micrograd distills them into a transparent, educational 100–150 line toy. PyTorch extends these concepts to large-scale tensors and modules, enabling production-ready models. The final message is that neural networks are about expressing computations, losses, gradients, and iterative optimization; micrograd demonstrates these principles clearly, while PyTorch handles efficiency and scale.

micrograd & backprop cheat sheet

Practical takeaways from this episode

Do This

Do implement a small Value class that stores data, grad, op, and children so you can build a computation graph (start at 1160).
Do implement and attach a local backward() closure for each operation that multiplies the incoming gradient by the local derivative and accumulates into children (see 3071).
Do build a topological ordering of the computation graph and run backward() in reverse topological order (see 4161).
Do zero all parameter gradients before each backward pass (zero_grad) to avoid accumulating gradients across steps (see 7724).
Do run small forward/backward/update loops (forward → zero_grad → backward → update) and monitor loss and LR to tune stability (see 7039 & 7429).

Avoid This

Don't overwrite child gradients in backward — use += to accumulate contributions from multiple paths (see 4494).
Don't forget to wrap numeric constants as Value objects if your operator expects Value operands (see 5161).
Don't run backward without resetting gradients (you’ll accumulate old gradients and get incorrect updates — see 7724).
Don't choose an overly large learning rate without monitoring loss; large steps can destabilize or explode training (see 7429).

Common Questions

micrograd is a tiny scalar autograd engine that implements automatic differentiation and backpropagation so you can see how gradient computation works under the hood; it's intentionally simple and pedagogical rather than optimized for production (see explanation at 50).

Topics

Mentioned in this video

More from Andrej Karpathy

View all 14 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free