Where can I get the GPT-2 124M weights to compare against?

OpenAI released GPT-2 weights and the Hugging Face Transformers repository provides easy PyTorch access to that state_dict, which the video uses to validate the implementation. (timestamp: 252)

What's the difference between GPT-2 and the original Transformer?

GPT-2 is a decoder-only Transformer and it repositions layer normalization (pre-norm vs post-norm) and adds an extra final layer norm; these changes are described when mapping the architecture. (timestamp: 828)

How do I tie input embeddings and lm_head weights in PyTorch?

Weight tying is implemented by making the lm_head weight reference the same tensor as the token embedding table (assigning the parameter pointer), which reduces parameter count and matches the original GPT behaviour. (timestamp: 6606)

What nonlinearity did GPT-2 use and why?

GPT-2 uses an approximate GELU-style activation historically chosen for TF performance; the video describes the exact vs approximate GELU trade-offs and uses the approximate version for faithful reproduction. (timestamp: 2406)

How should I format training batches from one long token stream?

Take a long token sequence, load B*T+1 tokens, then view the first B*T as inputs and the next B*T tokens (offset by 1) as labels; that yields X and Y batches aligned for next-token prediction. (timestamp: 5406)

Which optimizer and hyperparameters are recommended to match GPT-2/GPT-3 style training?

The video uses AdamW with betas tuned per GPT-3 guidance (beta1~0.9, beta2~0.95) and an epsilon like 1e-8, plus global gradient clipping and cosine LR schedules with warmup — specifics are set in the shown config. (timestamp: 6006)

How can I get large effective batch sizes on a small GPU?

Use gradient accumulation: run multiple forward/backward micro-batches, divide each loss by the accumulation steps, and perform the optimizer step only after accumulating the desired number of micro-batches. (timestamp: 9006)

What precision and PyTorch features give the best speedup on A100 GPUs?

Enable TF32 for matrix multiplies when available, use bfloat16 mixed precision via torch.autocast, and compile the model with torch.compile — together these give major speedups on A100-like hardware. (timestamp: 7206)

What is FlashAttention and when to use it?

FlashAttention is a fused attention kernel that avoids materializing the full attention matrix and reduces HBM reads/writes; use it to speed up attention (particularly for long contexts) where available. (timestamp: 8406)

Which datasets are practical for reproducing GPT-2 scale results?

Historically GPT-2 used WebText; modern alternatives include curated mixtures: Common Crawl variants, RedPajama, C4 and higher-quality filtered sets like FineWeb / fineweb-edu. The video uses a 10B-token FineWeb-edu sample. (timestamp: 9906)

How is H-SWAG evaluated with next-token models?

Render each multiple-choice option as a token completion; score each option by average next-token loss (cross-entropy) across the option tokens and pick the option with the highest likelihood (lowest loss). (timestamp: 11106)

How to run multi-GPU training with PyTorch DDP?

Launch the training script with torch.run (--nproc_per_node), set rank and world-size environment vars, move each process to its local GPU, wrap the model in DistributedDataParallel, and ensure data-loading is sharded by rank. (timestamp: 9306)

How often should I checkpoint and evaluate during pretraining?

The video logs and checkpoints periodically (e.g., every few thousand steps) and evaluates validation loss and H-SWAG at intervals — checkpointing the state_dicts allows later analysis and more thorough external evaluations. (timestamp: 11106)

Why did the presenter surpass GPT-2 124M performance with fewer tokens?

Possible reasons discussed include dataset quality differences (FineWeb-edu vs original WebText), evaluation caveats (overlap/test contamination), and improved preprocessing/filtering; the video discusses caveats and that this is not a definitive apples-to-apples comparison. (timestamp: 11766)

Key Moments

Let's reproduce GPT-2 (124M)

Andrej Karpathy

Science & Technology6 min read242 min video

Jun 9, 2024|1,118,560 views|31,651|1,187

neural network GPT karpathy LLM language model large language model ChatGPT NVIDIA GPU PyTorch Python deep learning

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Reproducing GPT-2 124M: load weights, rebuild from scratch, train with modern optimizations.

Key Insights

GPT-2 124M sits in OpenAI's GPT-2 miniseries (124M–1.5B); the 124M model has 12 Transformer blocks, 768 hidden channels, 1024 context length, and a 50,257-token vocabulary.

The workflow demonstrates loading OpenAI's released weights via Hugging Face, inspecting tensors (embeddings, position encodings), and then rebuilding a GPT-2-like model from scratch for training and evaluation.

A custom GPT-2 implementation is built to mirror the HuggingFace naming and structure, including a weight-tying scheme between the token embeddings and the final LM head to save parameters and improve consistency.

Training involves careful data pipelines, gradient accumulation, and distributed data parallel (DDP) across 8 GPUs, with attention to mixed precision (tf32, bf16), Torch.compile, and Flash Attention to accelerate performance.

Hyperparameters follow GPT-3-inspired guidance (warmup, cosine decay, large-batch regimes, weight decay, gradient clipping) and data strategies (Shakespeare for debugging, then Large-scale Web-derived data like FineWeb EDU for real pretraining).

Dataset choices are central: Tiny Shakespeare is used for quick iteration; FineWeb EDU (roughly 10B tokens in subsets) is used to emulate large-scale training data and evaluate with datasets like H-Swag for world-knowledge benchmarking.

GPT-2 124M: CONTEXT AND ARCHITECTURE

The video centers on reproducing the GPT-2 124M model, which is the smallest model in OpenAI's GPT-2 miniseries that scales up to 1.5B parameters. The 124M variant uses 12 Transformer blocks, 768 hidden channels, and a 1024-token context window, with a vocabulary of 50,257 tokens. It is a decoder-only Transformer, meaning there is no encoder in the stack and no cross-attention to a separate encoder sequence. The presenter emphasizes how scaling laws are typically displayed by plotting model size on the x-axis against downstream metrics (translation, summarization, QA) on the y-axis. They also note a known discrepancy in published parameter counts due to an error later corrected in the GitHub repo. The architecture features standard transformer blocks with layer normalization and residual connections, but GPT-2 introduces subtle differences in layer-norm placement and an additional normalization in the final self-attention block, affecting optimization and stability. As a pedagogical goal, this section frames the reproduction effort: start from the target (the 124M weights) and move toward a scratch-built implementation that can learn to outperform the original on a controlled dataset.

RECONSTRUCTING GPT-2: LOADING, INSPECTING, AND PORTING WEIGHTS

The speaker demonstrates loading the 124M weights from OpenAI’s release via Hugging Face, and then inspecting the raw state dictionary to understand the shapes and roles of weights (token embedding WTE with shape 50257 x 768, positional embeddings PE, and the rest of the transformer parameters). They emphasize that GPT-2’s weights are produced in TensorFlow originally, but Hugging Face provides PyTorch-compatible access, which makes loading and experimentation easier. They highlight the token vocabulary and the idea that positional embeddings tend to learn sinusoidal-like structure, even though they are trainable. The process includes printing weight keys, checking shapes, and verifying that sampling from the loaded model yields coherent text, thereby validating a successful transfer from released weights to a PyTorch workflow.

FROM PRETRAINED TO TRAIN-FROM-SCRATCH: BUILDING A CUSTOM GPT-2

With weights loaded as a target, the next step is to implement GPT-2 from scratch in PyTorch to reproduce and then train it anew. The skeleton uses a Transformer-container with a token embedding matrix, positional embedding matrix, a stack of 12 transformer blocks (H = 12), a final layer normalization, and an LM head projecting back to the vocabulary. The block design reflects GPT-2’s distinctive normalization layout (normalization and residual pathways integrated inside the block) and emphasizes the distinction between attention (communication across tokens) and the MLP (per-token processing). A critical design decision is weight tying: the token embeddings are tied to the final LM head weights, dramatically reducing parameters (~40M) and aligning input/output statistics for similar tokens. The implementation aims to preserve the architecture while exposing enough structure to swap between the OpenAI weights and a from-scratch parameter initialization, enabling a controlled study of learning dynamics.

TRAINING PIPELINE: DATA, LOOPS, AND OPTIMIZATIONS

Training shifts from a small debugging dataset to large-scale pretraining, driven by a carefully engineered data pipeline and a suite of performance optimizations. Initially, tiny Shakespeare serves as a debugging sandbox to validate the forward pass, loss computation, and gradient flow on CPU or GPU. The plan scales to FineWeb EDU, a ~10B token subset of high-quality educational data (via Hugging Face datasets), with shards to manage storage and streaming. The data loader assembles B x T sequences, computes target labels (offset by one token), and supports gradient accumulation to emulate larger batch sizes that exceed GPU memory. The training loop incorporates mixed precision (tf32, bf16) and PyTorch autocab or context-managed autocasting for speed, with Torch.compile providing additional kernel fusion to reduce Python overhead. They also discuss practical tricks: padding the vocabulary to 50304 to align with memory tiling (a so-called “ugly number” fix), weight decay scheduling, and using a fusedAdamW for speed. They explore distributed data parallel (DDP) across 8 GPUs, with careful synchronization to average gradients only at the end of gradient-accumulation steps, plus a robust approach to logging and validating progress via a validation split. All told, the training pipeline illustrates how to scale a GPT-2-like model on modern hardware while maintaining reproducibility and experimental control.

SCALING, HYPERPARAMETERS, AND EVALUATION: PRETRAINING STRATEGIES

The speaker tallies a set of GPT-3-inspired hyperparameters to guide pretraining: learning-rate schedules (cosine decay with warmup), gradient clipping, weight decay, and precision strategies. They implement warmup over hundreds of millions of tokens, then cosine decay to a low final rate, mirroring GPT-3 practices. Gradient clipping stabilizes training in the early, volatile rounds. They discuss the transition from tf32 to bf16, with careful attention to which operations stay in higher precision (e.g., layer norms) and which can tolerate reduced precision (matrix multiplies). They also cover the use of Flash Attention to avoid materializing the large attention matrix and to reduce memory bandwidth, coupled with Torch.compile to fuse kernels. Training is performed in a distributed fashion (DDP) with gradient synchronization only at the end of each micro-batch accumulation, and a detailed note on the complexities of getting learning-rate schedules, batch sizes, and weight-decay exact right in a multi-GPU setting. Overall, this section demonstrates a practical approach to scaling up GPT-2-like training while maintaining a close connection to GPT-3-style hyperparameters and evaluation strategies.

DATASETS, EVALUATION, AND FUTURE: FINE WEB EDU AND H-SWAG

The long-term data strategy shifts from toy corpora to a substantial real-pretraining regime. The plan involves FineWeb EDU, a curated 10B-token subset of high-quality content (educationally oriented data drawn from Common Crawl filtering and educational domains). The dataset is prepared as shards to support streaming and efficient IO. Alongside training, the team outlines evaluation plans: periodic validation loss with a dedicated validation shard, plus extrinsic benchmarks like H-Swag to test world knowledge and reasoning in sentence completion tasks. H-Swag offers an early signal, smooth progression, and an established baseline (GPT-3 scales) for comparison. The discussion also touches on practical data pipeline concerns, reproducibility promises (commit history and open-sourcing the codebase), and future steps to tighten calibration, improve initialization, and refine optimization strategies across larger model sizes.

Mentioned in This Episode

●Products

●Software & Apps

●Tools

●Companies

●Studies Cited

●People Referenced

Common Questions

The video explains you can reproduce GPT-2 (124M) in roughly an hour on modern cloud GPUs and it can cost around $10 depending on instance pricing; that estimate is given near the start. (timestamp: 161)

Topics

Gpt-3 Pretraining Mixed Precision Flash Attention Torch.compile Dataset Curation Adamw Weight Tying Gradient Accumulation Fineweb-edu Evaluation

Mentioned in this video

Software & Apps

GPT-2 (124M)

The 124-million-parameter GPT-2 model variant released by OpenAI that this video reproduces and compares against.

Hugging Face Transformers

Library used to load the PyTorch GPT-2 model and state dict as an easier source for the released weights.

tiktoken

OpenAI's tokenizer library used to encode text into GPT-2 token IDs for sampling and training.

Tiny Shakespeare

Small dataset used as a simple debugging dataset to validate batching, tokenization and overfitting.

OpenWebText

A community reproduction attempt of the web-text dataset used historically for GPT-2 (referenced when discussing GPT-2 training data).

RedPajama

Mentioned as an example of an available cleaned dataset (RedPajama/slim) suitable for language-model training.

Common Crawl / C4

Large web-crawl data sources often used in data mixtures for LLM pretraining; discussed in the training-data section.

torch.autocast

PyTorch autocasting context manager for mixed precision (bfloat16) training to reduce memory and increase speed.

nanoGPT

Reference minimal GPT implementation (the video builds toward a similar minimal implementation).

LLM.C / lm.C

A pure CUDA implementation referenced as a faster alternative to the PyTorch reference for GPT training.

torch.compile

PyTorch compiler used to fuse kernels and reduce Python overhead — large speedups discussed and demonstrated.

Studies & Research

GPT-3 paper

GPT-3 paper used to borrow more concrete hyperparameter and optimization guidance not present in the GPT-2 paper.

G (approximate GELU)

The G nonlinearity (approximate GELU) used by GPT-2; the video explains exact vs approximate forms and historical reasons.

Attention Is All You Need

Original Transformer paper referenced to explain positional encodings and encoder/decoder distinctions.

FlashAttention (paper)

Stanford paper and implementation fused into PyTorch kernels to speed up attention by avoiding materializing the full attention.

NVIDIA 2018 online softmax (paper)

Earlier NVIDIA paper describing an online softmax normalization technique reused by FlashAttention.

Companies

Lambda Labs

Cloud provider referenced as the author's preferred place to rent GPU instances (A100, etc.).

Products

NVIDIA A100 SXM 80 GB

GPU model used in the author's machine (discussed with precision and TF32/BF16 capabilities).

Concepts

FineWeb / fineweb-edu

High-quality filtered subset of Common Crawl (fineweb-edu) used in the video as a 10B-token sample for pretraining experiments.

AdamW

Optimizer chosen for training (AdamW variant) with recommended betas and epsilon following GPT-3 guidance.

People

Daniel Hendrix

Mentioned in context of historical discussion around approximate GELU and implementation tradeoffs.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free