Key Moments

Let's reproduce GPT-2 (124M)

Andrej KarpathyAndrej Karpathy
Science & Technology6 min read242 min video
Jun 9, 2024|1,029,195 views|30,102|1,168
Save to Pod
TL;DR

Reproducing GPT-2 124M: load weights, rebuild from scratch, train with modern optimizations.

Key Insights

1

GPT-2 124M sits in OpenAI's GPT-2 miniseries (124M–1.5B); the 124M model has 12 Transformer blocks, 768 hidden channels, 1024 context length, and a 50,257-token vocabulary.

2

The workflow demonstrates loading OpenAI's released weights via Hugging Face, inspecting tensors (embeddings, position encodings), and then rebuilding a GPT-2-like model from scratch for training and evaluation.

3

A custom GPT-2 implementation is built to mirror the HuggingFace naming and structure, including a weight-tying scheme between the token embeddings and the final LM head to save parameters and improve consistency.

4

Training involves careful data pipelines, gradient accumulation, and distributed data parallel (DDP) across 8 GPUs, with attention to mixed precision (tf32, bf16), Torch.compile, and Flash Attention to accelerate performance.

5

Hyperparameters follow GPT-3-inspired guidance (warmup, cosine decay, large-batch regimes, weight decay, gradient clipping) and data strategies (Shakespeare for debugging, then Large-scale Web-derived data like FineWeb EDU for real pretraining).

6

Dataset choices are central: Tiny Shakespeare is used for quick iteration; FineWeb EDU (roughly 10B tokens in subsets) is used to emulate large-scale training data and evaluate with datasets like H-Swag for world-knowledge benchmarking.

GPT-2 124M: CONTEXT AND ARCHITECTURE

The video centers on reproducing the GPT-2 124M model, which is the smallest model in OpenAI's GPT-2 miniseries that scales up to 1.5B parameters. The 124M variant uses 12 Transformer blocks, 768 hidden channels, and a 1024-token context window, with a vocabulary of 50,257 tokens. It is a decoder-only Transformer, meaning there is no encoder in the stack and no cross-attention to a separate encoder sequence. The presenter emphasizes how scaling laws are typically displayed by plotting model size on the x-axis against downstream metrics (translation, summarization, QA) on the y-axis. They also note a known discrepancy in published parameter counts due to an error later corrected in the GitHub repo. The architecture features standard transformer blocks with layer normalization and residual connections, but GPT-2 introduces subtle differences in layer-norm placement and an additional normalization in the final self-attention block, affecting optimization and stability. As a pedagogical goal, this section frames the reproduction effort: start from the target (the 124M weights) and move toward a scratch-built implementation that can learn to outperform the original on a controlled dataset.

RECONSTRUCTING GPT-2: LOADING, INSPECTING, AND PORTING WEIGHTS

The speaker demonstrates loading the 124M weights from OpenAI’s release via Hugging Face, and then inspecting the raw state dictionary to understand the shapes and roles of weights (token embedding WTE with shape 50257 x 768, positional embeddings PE, and the rest of the transformer parameters). They emphasize that GPT-2’s weights are produced in TensorFlow originally, but Hugging Face provides PyTorch-compatible access, which makes loading and experimentation easier. They highlight the token vocabulary and the idea that positional embeddings tend to learn sinusoidal-like structure, even though they are trainable. The process includes printing weight keys, checking shapes, and verifying that sampling from the loaded model yields coherent text, thereby validating a successful transfer from released weights to a PyTorch workflow.

FROM PRETRAINED TO TRAIN-FROM-SCRATCH: BUILDING A CUSTOM GPT-2

With weights loaded as a target, the next step is to implement GPT-2 from scratch in PyTorch to reproduce and then train it anew. The skeleton uses a Transformer-container with a token embedding matrix, positional embedding matrix, a stack of 12 transformer blocks (H = 12), a final layer normalization, and an LM head projecting back to the vocabulary. The block design reflects GPT-2’s distinctive normalization layout (normalization and residual pathways integrated inside the block) and emphasizes the distinction between attention (communication across tokens) and the MLP (per-token processing). A critical design decision is weight tying: the token embeddings are tied to the final LM head weights, dramatically reducing parameters (~40M) and aligning input/output statistics for similar tokens. The implementation aims to preserve the architecture while exposing enough structure to swap between the OpenAI weights and a from-scratch parameter initialization, enabling a controlled study of learning dynamics.

TRAINING PIPELINE: DATA, LOOPS, AND OPTIMIZATIONS

Training shifts from a small debugging dataset to large-scale pretraining, driven by a carefully engineered data pipeline and a suite of performance optimizations. Initially, tiny Shakespeare serves as a debugging sandbox to validate the forward pass, loss computation, and gradient flow on CPU or GPU. The plan scales to FineWeb EDU, a ~10B token subset of high-quality educational data (via Hugging Face datasets), with shards to manage storage and streaming. The data loader assembles B x T sequences, computes target labels (offset by one token), and supports gradient accumulation to emulate larger batch sizes that exceed GPU memory. The training loop incorporates mixed precision (tf32, bf16) and PyTorch autocab or context-managed autocasting for speed, with Torch.compile providing additional kernel fusion to reduce Python overhead. They also discuss practical tricks: padding the vocabulary to 50304 to align with memory tiling (a so-called “ugly number” fix), weight decay scheduling, and using a fusedAdamW for speed. They explore distributed data parallel (DDP) across 8 GPUs, with careful synchronization to average gradients only at the end of gradient-accumulation steps, plus a robust approach to logging and validating progress via a validation split. All told, the training pipeline illustrates how to scale a GPT-2-like model on modern hardware while maintaining reproducibility and experimental control.

SCALING, HYPERPARAMETERS, AND EVALUATION: PRETRAINING STRATEGIES

The speaker tallies a set of GPT-3-inspired hyperparameters to guide pretraining: learning-rate schedules (cosine decay with warmup), gradient clipping, weight decay, and precision strategies. They implement warmup over hundreds of millions of tokens, then cosine decay to a low final rate, mirroring GPT-3 practices. Gradient clipping stabilizes training in the early, volatile rounds. They discuss the transition from tf32 to bf16, with careful attention to which operations stay in higher precision (e.g., layer norms) and which can tolerate reduced precision (matrix multiplies). They also cover the use of Flash Attention to avoid materializing the large attention matrix and to reduce memory bandwidth, coupled with Torch.compile to fuse kernels. Training is performed in a distributed fashion (DDP) with gradient synchronization only at the end of each micro-batch accumulation, and a detailed note on the complexities of getting learning-rate schedules, batch sizes, and weight-decay exact right in a multi-GPU setting. Overall, this section demonstrates a practical approach to scaling up GPT-2-like training while maintaining a close connection to GPT-3-style hyperparameters and evaluation strategies.

DATASETS, EVALUATION, AND FUTURE: FINE WEB EDU AND H-SWAG

The long-term data strategy shifts from toy corpora to a substantial real-pretraining regime. The plan involves FineWeb EDU, a curated 10B-token subset of high-quality content (educationally oriented data drawn from Common Crawl filtering and educational domains). The dataset is prepared as shards to support streaming and efficient IO. Alongside training, the team outlines evaluation plans: periodic validation loss with a dedicated validation shard, plus extrinsic benchmarks like H-Swag to test world knowledge and reasoning in sentence completion tasks. H-Swag offers an early signal, smooth progression, and an established baseline (GPT-3 scales) for comparison. The discussion also touches on practical data pipeline concerns, reproducibility promises (commit history and open-sourcing the codebase), and future steps to tighten calibration, improve initialization, and refine optimization strategies across larger model sizes.

Common Questions

The video explains you can reproduce GPT-2 (124M) in roughly an hour on modern cloud GPUs and it can cost around $10 depending on instance pricing; that estimate is given near the start. (timestamp: 161)

Topics

Mentioned in this video

More from Andrej Karpathy

View all 14 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free