Let's reproduce GPT-2 (124M)
Key Moments
Reproducing GPT-2 124M: load weights, rebuild from scratch, train with modern optimizations.
Key Insights
GPT-2 124M sits in OpenAI's GPT-2 miniseries (124M–1.5B); the 124M model has 12 Transformer blocks, 768 hidden channels, 1024 context length, and a 50,257-token vocabulary.
The workflow demonstrates loading OpenAI's released weights via Hugging Face, inspecting tensors (embeddings, position encodings), and then rebuilding a GPT-2-like model from scratch for training and evaluation.
A custom GPT-2 implementation is built to mirror the HuggingFace naming and structure, including a weight-tying scheme between the token embeddings and the final LM head to save parameters and improve consistency.
Training involves careful data pipelines, gradient accumulation, and distributed data parallel (DDP) across 8 GPUs, with attention to mixed precision (tf32, bf16), Torch.compile, and Flash Attention to accelerate performance.
Hyperparameters follow GPT-3-inspired guidance (warmup, cosine decay, large-batch regimes, weight decay, gradient clipping) and data strategies (Shakespeare for debugging, then Large-scale Web-derived data like FineWeb EDU for real pretraining).
Dataset choices are central: Tiny Shakespeare is used for quick iteration; FineWeb EDU (roughly 10B tokens in subsets) is used to emulate large-scale training data and evaluate with datasets like H-Swag for world-knowledge benchmarking.
GPT-2 124M: CONTEXT AND ARCHITECTURE
The video centers on reproducing the GPT-2 124M model, which is the smallest model in OpenAI's GPT-2 miniseries that scales up to 1.5B parameters. The 124M variant uses 12 Transformer blocks, 768 hidden channels, and a 1024-token context window, with a vocabulary of 50,257 tokens. It is a decoder-only Transformer, meaning there is no encoder in the stack and no cross-attention to a separate encoder sequence. The presenter emphasizes how scaling laws are typically displayed by plotting model size on the x-axis against downstream metrics (translation, summarization, QA) on the y-axis. They also note a known discrepancy in published parameter counts due to an error later corrected in the GitHub repo. The architecture features standard transformer blocks with layer normalization and residual connections, but GPT-2 introduces subtle differences in layer-norm placement and an additional normalization in the final self-attention block, affecting optimization and stability. As a pedagogical goal, this section frames the reproduction effort: start from the target (the 124M weights) and move toward a scratch-built implementation that can learn to outperform the original on a controlled dataset.
RECONSTRUCTING GPT-2: LOADING, INSPECTING, AND PORTING WEIGHTS
The speaker demonstrates loading the 124M weights from OpenAI’s release via Hugging Face, and then inspecting the raw state dictionary to understand the shapes and roles of weights (token embedding WTE with shape 50257 x 768, positional embeddings PE, and the rest of the transformer parameters). They emphasize that GPT-2’s weights are produced in TensorFlow originally, but Hugging Face provides PyTorch-compatible access, which makes loading and experimentation easier. They highlight the token vocabulary and the idea that positional embeddings tend to learn sinusoidal-like structure, even though they are trainable. The process includes printing weight keys, checking shapes, and verifying that sampling from the loaded model yields coherent text, thereby validating a successful transfer from released weights to a PyTorch workflow.
FROM PRETRAINED TO TRAIN-FROM-SCRATCH: BUILDING A CUSTOM GPT-2
With weights loaded as a target, the next step is to implement GPT-2 from scratch in PyTorch to reproduce and then train it anew. The skeleton uses a Transformer-container with a token embedding matrix, positional embedding matrix, a stack of 12 transformer blocks (H = 12), a final layer normalization, and an LM head projecting back to the vocabulary. The block design reflects GPT-2’s distinctive normalization layout (normalization and residual pathways integrated inside the block) and emphasizes the distinction between attention (communication across tokens) and the MLP (per-token processing). A critical design decision is weight tying: the token embeddings are tied to the final LM head weights, dramatically reducing parameters (~40M) and aligning input/output statistics for similar tokens. The implementation aims to preserve the architecture while exposing enough structure to swap between the OpenAI weights and a from-scratch parameter initialization, enabling a controlled study of learning dynamics.
TRAINING PIPELINE: DATA, LOOPS, AND OPTIMIZATIONS
Training shifts from a small debugging dataset to large-scale pretraining, driven by a carefully engineered data pipeline and a suite of performance optimizations. Initially, tiny Shakespeare serves as a debugging sandbox to validate the forward pass, loss computation, and gradient flow on CPU or GPU. The plan scales to FineWeb EDU, a ~10B token subset of high-quality educational data (via Hugging Face datasets), with shards to manage storage and streaming. The data loader assembles B x T sequences, computes target labels (offset by one token), and supports gradient accumulation to emulate larger batch sizes that exceed GPU memory. The training loop incorporates mixed precision (tf32, bf16) and PyTorch autocab or context-managed autocasting for speed, with Torch.compile providing additional kernel fusion to reduce Python overhead. They also discuss practical tricks: padding the vocabulary to 50304 to align with memory tiling (a so-called “ugly number” fix), weight decay scheduling, and using a fusedAdamW for speed. They explore distributed data parallel (DDP) across 8 GPUs, with careful synchronization to average gradients only at the end of gradient-accumulation steps, plus a robust approach to logging and validating progress via a validation split. All told, the training pipeline illustrates how to scale a GPT-2-like model on modern hardware while maintaining reproducibility and experimental control.
SCALING, HYPERPARAMETERS, AND EVALUATION: PRETRAINING STRATEGIES
The speaker tallies a set of GPT-3-inspired hyperparameters to guide pretraining: learning-rate schedules (cosine decay with warmup), gradient clipping, weight decay, and precision strategies. They implement warmup over hundreds of millions of tokens, then cosine decay to a low final rate, mirroring GPT-3 practices. Gradient clipping stabilizes training in the early, volatile rounds. They discuss the transition from tf32 to bf16, with careful attention to which operations stay in higher precision (e.g., layer norms) and which can tolerate reduced precision (matrix multiplies). They also cover the use of Flash Attention to avoid materializing the large attention matrix and to reduce memory bandwidth, coupled with Torch.compile to fuse kernels. Training is performed in a distributed fashion (DDP) with gradient synchronization only at the end of each micro-batch accumulation, and a detailed note on the complexities of getting learning-rate schedules, batch sizes, and weight-decay exact right in a multi-GPU setting. Overall, this section demonstrates a practical approach to scaling up GPT-2-like training while maintaining a close connection to GPT-3-style hyperparameters and evaluation strategies.
DATASETS, EVALUATION, AND FUTURE: FINE WEB EDU AND H-SWAG
The long-term data strategy shifts from toy corpora to a substantial real-pretraining regime. The plan involves FineWeb EDU, a curated 10B-token subset of high-quality content (educationally oriented data drawn from Common Crawl filtering and educational domains). The dataset is prepared as shards to support streaming and efficient IO. Alongside training, the team outlines evaluation plans: periodic validation loss with a dedicated validation shard, plus extrinsic benchmarks like H-Swag to test world knowledge and reasoning in sentence completion tasks. H-Swag offers an early signal, smooth progression, and an established baseline (GPT-3 scales) for comparison. The discussion also touches on practical data pipeline concerns, reproducibility promises (commit history and open-sourcing the codebase), and future steps to tighten calibration, improve initialization, and refine optimization strategies across larger model sizes.
Mentioned in This Episode
●Products
●Software & Apps
●Tools
●Companies
●Studies Cited
●People Referenced
Common Questions
The video explains you can reproduce GPT-2 (124M) in roughly an hour on modern cloud GPUs and it can cost around $10 depending on instance pricing; that estimate is given near the start. (timestamp: 161)
Topics
Mentioned in this video
The 124-million-parameter GPT-2 model variant released by OpenAI that this video reproduces and compares against.
GPT-3 paper used to borrow more concrete hyperparameter and optimization guidance not present in the GPT-2 paper.
Library used to load the PyTorch GPT-2 model and state dict as an easier source for the released weights.
OpenAI's tokenizer library used to encode text into GPT-2 token IDs for sampling and training.
Cloud provider referenced as the author's preferred place to rent GPU instances (A100, etc.).
GPU model used in the author's machine (discussed with precision and TF32/BF16 capabilities).
Small dataset used as a simple debugging dataset to validate batching, tokenization and overfitting.
A community reproduction attempt of the web-text dataset used historically for GPT-2 (referenced when discussing GPT-2 training data).
High-quality filtered subset of Common Crawl (fineweb-edu) used in the video as a 10B-token sample for pretraining experiments.
Mentioned as an example of an available cleaned dataset (RedPajama/slim) suitable for language-model training.
Large web-crawl data sources often used in data mixtures for LLM pretraining; discussed in the training-data section.
Optimizer chosen for training (AdamW variant) with recommended betas and epsilon following GPT-3 guidance.
The G nonlinearity (approximate GELU) used by GPT-2; the video explains exact vs approximate forms and historical reasons.
Original Transformer paper referenced to explain positional encodings and encoder/decoder distinctions.
PyTorch autocasting context manager for mixed precision (bfloat16) training to reduce memory and increase speed.
Reference minimal GPT implementation (the video builds toward a similar minimal implementation).
A pure CUDA implementation referenced as a faster alternative to the PyTorch reference for GPT training.
Mentioned in context of historical discussion around approximate GELU and implementation tradeoffs.
Stanford paper and implementation fused into PyTorch kernels to speed up attention by avoiding materializing the full attention.
Earlier NVIDIA paper describing an online softmax normalization technique reused by FlashAttention.
PyTorch compiler used to fuse kernels and reduce Python overhead — large speedups discussed and demonstrated.
More from Andrej Karpathy
View all 14 summaries
132 minHow I use LLMs
212 minDeep Dive into LLMs like ChatGPT
134 minLet's build the GPT Tokenizer
60 min[1hr Talk] Intro to Large Language Models
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free