Let's build GPT: from scratch, in code, spelled out.

Andrej KarpathyAndrej Karpathy
Science & Technology4 min read117 min video
Jan 17, 2023|6,933,611 views|152,124|3,041
Save to Pod

Key Moments

TL;DR

From scratch to GPT: building a Transformer, training on Shakespeare, and scaling.

Key Insights

1

GPT is a decoder‑only Transformer (Generative Pre‑trained Transformer) that learns to predict the next token in a sequence by modeling word (or character) dependencies with attention.

2

A tiny but powerful hands‑on dataset (tiny Shakespeare) demonstrates core training steps: character‑level tokenization, a 1 MB dataset, 65‑token vocabulary, and 90/10 train/validation split.

3

Self‑attention with Q, K, V, and a masked triangular attention matrix enables autoregressive generation; scaling by the head size stabilizes softmax and improves optimization.

4

Progression from a simple embedding‑based model to a full Transformer with multi‑head attention, residual connections, layer normalization, and dropout dramatically boosts performance and trainability.

5

Scaling the model (longer context, more layers, more heads, larger embeddings) yields strong empirical gains (e.g., validation loss dropping from ~2.4 to ~1.48) on Shakespeare text.

6

A practical NanoGPT workflow shows how to structure code (model.py, train.py), implement generation, and experiment with depth, heads, dropout, and layer norms.

7

Pre‑training on large corpora creates a general language model; alignment/fine‑tuning (e.g., RLHF) is a separate stage used to steer models toward helpful, safe assistant behavior.

INTRO TO GPT AND LANGUAGE MODELS

The talk begins by framing GPT as a language model that completes sequences of text. It emphasizes the probabilistic nature of outputs: the same prompt can yield different but plausible continuations. GPT is described as a decoder‑only Transformer—pre‑trained on vast data and then used to generate next tokens. A hands‑on example in the lecture uses a short prompt about AI to showcase how such models produce coherent, world‑changing prose.

TRANSFORMER FOUNDATIONS AND THE ATTENTION MECHANISM

The foundational idea is the Transformer architecture introduced in Attention Is All You Need (2017). The model uses attention to weigh relationships between tokens. The speaker explains the core components—queries, keys, values—and how the attention scores are computed, scaled, and masked to enforce autoregressive behavior. Residual connections and layer normalization are highlighted as critical for training stability, especially as networks deepen and context grows.

DATA PREP: SHAKESPEARE AND TOKENIZATION CHOICES

To make the concept concrete, a tiny Shakespeare dataset (~1 MB) is used. The data is treated at the character level, yielding a vocabulary of 65 characters and enabling straightforward encoding/decoding. The data is split 90/10 into train and validation sets, and training samples are formed by sampling short, overlapping blocks (e.g., block size 8) to create multiple training examples from each snippet, showcasing the need for efficient batching and context handling.

FROM SCRATCH TO NANOGPT: IMPLEMENTATION OVERVIEW

The lecture transitions from a minimal embedding‑based baseline to a Transformer by introducing NanogPT, a compact two‑file codebase (model.py and train.py). Early on a simple token embedding with a linear head is used; generation is performed by sampling from the last token. The narrative then introduces self‑attention, multi‑head setups, and the practical coding steps required to structure data, build the model, and run iterative improvements.

TRAINING LOOP AND LOSS: GETTING A MODEL TO LEARN

The training loop uses batches of chunked data, computes cross‑entropy loss, and updates parameters with an optimizer (Adam). The talk demonstrates that initial losses are relatively high (around 4+ in log space) and gradually decrease as the model learns. Early generation attempts yield random text, illustrating the progression from a toy model to something that begins to capture Shakespearean style as training proceeds.

SELF-ATTENTION DETAILS: QKV, MASKING, SCALE, AND HEADS

A core portion explains self‑attention in detail: each token produces a query, key, and value. The attention scores are computed as a dot product of queries with keys, masked to prevent looking at future tokens, and scaled by the square root of the head size to stabilize softmax. The resulting weights are used to combine values, enabling data‑dependent information flow between tokens. The section also covers single‑head vs multi‑head attention and why multiple heads improve expressivity.

BUILDING A DECODED-ONLY TRANSFORMER AND LARGER DEPTH

The speaker builds a decoder‑only Transformer with residual connections and layer normalization. Dropout is introduced to regularize training as depth increases. The pre‑norm (layer norm before the block) variant is highlighted as beneficial for stability in deeper networks. This section emphasizes architectural choices that mirror the original GPT design while adapting them for an experimental, educational setting.

SCALE, REGULARIZATION, AND REAL‑WORLD TRAINING RESULTS

As depth, heads, and embedding sizes increase, the model’s performance improves. The talk documents pushing a Shakespeare model to longer contexts (block size 256) and richer representations (embedding 384, 6 heads, 6 layers) with dropout. A validation loss of about 1.48 is reported after substantial training on an A100 GPU, demonstrating practical gains from scaling, regularization, and careful optimization.

PRE‑TRAINING VS FINE‑TUNING AND ALIGNMENT OVERVIEW

The session culminates with a high‑level map of how real GPT systems are built: (1) pre‑training on large, general corpora to learn broad language structure, and (2) fine‑tuning/alignment stages (e.g., supervised fine‑tuning, reward modeling, RLHF) to steer outputs toward helpful, safe assistant behavior. The speaker notes that the lecture focuses on pre‑training concepts and outlines the broad workflow without diving into proprietary alignment data or policies.

Minimal steps to build & train a tiny GPT (cheat sheet)

Practical takeaways from this episode

Do This

Do start with a small dataset (tiny Shakespeare) and a tiny character-level tokenizer to learn the pipeline quickly.
Do chunk data into fixed 'block_size' contexts and sample random batches (batch_size × block_size) for training.
Do use embeddings + positional embeddings, and implement scaled dot-product attention (queries, keys, values).
Do add residual (skip) connections and LayerNorm (pre-norm) to stabilize training of stacked blocks.
Do use Adam with a moderate learning rate and an evaluation routine that estimates train/val loss across multiple batches.
Do scale gradually: increase embedding size, number of heads / layers, and block size; add dropout if overfitting.

Avoid This

Don't try to train a large GPT-scale model on a CPU — use a GPU and smaller model config for feasibility.
Don't forget to mask future tokens in decoder/self-attention for autoregressive language modeling.
Don't omit LayerNorm / residuals for deeper stacks — optimization will become unstable without them.
Don't evaluate using a single noisy batch; average losses across many batches for a reliable signal.

Example model scaling and validation loss (from lecture)

Data extracted from this episode

Model config (embed / layers / heads / block)Training time (approx)Validation loss (approx)
Small baseline (nbed ~32, small blocks)minutes≈ 2.5
Intermediate (with single head, FFN)minutes≈ 2.06
Scaled model (embed=384, n_layer=6, n_head=6, block=256)≈ 15 minutes on A100≈ 1.48

Common Questions

The lecture trains a character-level Transformer on the tiny Shakespeare dataset (a ~1 MB concatenation of Shakespeare's works). See the dataset introduction at 231 seconds.

Topics

Mentioned in this video

More from Andrej Karpathy

View all 14 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free