Deep Dive into LLMs like ChatGPT

Andrej KarpathyAndrej Karpathy
Science & Technology5 min read212 min video
Feb 5, 2025|5,797,389 views|106,180|3,646
Save to Pod

Key Moments

TL;DR

Data-driven training, tokens, post-training, and RLHF shape ChatGPT-like models; beware hallucinations.

Key Insights

1

Pre-training uses vast, filtered internet text (e.g., Fine Web, Common Crawl) to create a base model that behaves like an internet document simulator.

2

Text is converted into a fixed vocabulary of tokens (roughly 100k symbols); tokenization (and its quirks) is central to how models read and generate text.

3

A base model becomes an assistant through post-training: supervised fine-tuning, instruction tuning, and reinforcement learning with human feedback (RLHF).

4

Generation is probabilistic: inference samples next tokens from a distribution within a context window; tools like web search and code interpreters augment capabilities.

5

Hallucinations are common; mitigations include retrieval, uncertainty signaling, system prompts, and explicit refusals; reward models can help but have limitations.

6

Reinforcement learning in practice favors indirect human feedback (RLHF) rather than pure RL; the approach scales but is prone to gaming and alignment challenges.

PRE-TRAINING PIPELINE AND DATA SOURCES

The journey begins with pre-training, where a model is exposed to enormous volumes of internet text to learn the statistical patterns of language. Data pipelines combine sources like Common Crawl and curated sets such as Fine Web, with rigorous filtering: URL/domain filtering to remove malware and spam, text extraction to strip HTML, language filtering to emphasize English or multilingual coverage, and deduplication plus PIIs removal. The aim is massive, diverse, high-quality text, since the model’s “knowledge” in weight form is a compressed reflection of the internet. In practice, this yields datasets on the order of tens of terabytes and trillions of tokens (e.g., 15 trillion tokens in some datasets), which then become the training material for next-token prediction. The output of this stage is a base model — a powerful internet document simulator that can generate plausible text but isn’t yet an assistant capable of coherent, multi-turn dialogue without further tuning.

TOKENIZATION AND REPRESENTATION

To feed text to neural networks, we must convert it into a finite set of symbols called tokens. Text is turned into a one-dimensional sequence of tokens via encoding schemes (e.g., UTF-8 bits, grouped into bytes, then merged into subword symbols). A key idea is balancing vocabulary size with sequence length: more symbols means shorter sequences but a larger vocabulary. In practice, companies use byte-pair encoding style methods (The Byte-Pair or bit-level encoding plus pair merging) to create about 100,000+ tokens; for example GPT-4 uses around 100,277 tokens. Tokenization is not just a technical footnote—it determines how efficiently the model can represent text and how it handles rare words, capitalization, and spacing. Tokenized forms are the real inputs for both training and inference, and tokenization quirks (like how Hello World may split into different tokens) illustrate how sensitive generation is to token boundaries.

FROM BASE MODEL TO ASSISTANT: POST-TRAINING AND INSTRUCTION TUNING

A base model trained on internet text is not yet a useful assistant. The industry then applies post-training steps to turn the base into a deployable assistant: supervised fine-tuning (SFT), instruction tuning, and reinforcement learning with human feedback (RLHF). SFT uses human-generated prompts and ideal assistant responses to nudge the model toward helpful, honest, and harmless behavior. InstructGPT popularized this approach, and datasets like UltraChat and related mixtures increasingly fuse real and synthetic prompts to cover diverse tasks. RLHF introduces a reward model trained to predict human judgments, enabling policy optimization beyond static imitation. The post-training phase is less compute-heavy than pre-training but still expensive and data-intensive; it also introduces safety and alignment considerations through explicit labeling instructions and system prompts that shape the model’s persona and responses.

INFERENCE, CONTEXT, AND TOOL USE

During inference, the model generates text token by token, sampling from a probability distribution over a large vocabulary conditioned on an ever-growing context window. Modern models manage long contexts (thousands of tokens, sometimes more with optimizations) and balance speed with quality. Importantly, generation is a form of autocomplete, not a perfect oracle. To stay current and capable, these models increasingly use tools: web search to fetch fresh data, Python code interpreters to compute or verify results, and external APIs to perform tasks (e.g., calendar lookups, data fetches). Prompts and system messages shape behavior, including few-shot demonstrations and role-play orchestrations to elicit desired assistant behavior. The idea of in-context learning—solving tasks by example within the prompt—also plays a central role in steering the model’s output without explicit re-training.

HALLUCINATIONS, FACTUALITY, AND MITIGATION

A persistent challenge is hallucination: the model may make up facts or confident-sounding statements because it is a statistical token predictor, not a fact-checking engine. Fallible behavior is illustrated by memorization of certain sources (e.g., Wikipedia) or making up plausible-sounding but false details about obscure people. Mitigations include: explicit refusals when uncertain, retrieval-augmented generation (web search) to refresh memory, use of tools to verify information, and retrieval of source citations. Advanced strategies involve evaluating factuality through internal checks or external judges and integrating uncertainty signals. Even with RLHF, the system remains probabilistic, so users must validate critical outputs. The model’s self-identity and knowledge cutoffs are also nuanced issues; system prompts or hardcoded identity can influence how it presents itself in conversations.

REINFORCEMENT LEARNING: RLHF, REWARD MODELS, LIMITATIONS, AND FUTURE DIRECTIONS

Reinforcement learning in language models often takes the form of RLHF: a reward model learns to score model outputs based on human preferences, enabling the base model to optimize for desired behavior through policy updates. Humans provide ordering or ratings across multiple candidate responses, and a separate reward model learns to imitate these judgments. This indirection dramatically reduces human labeling burden and enables optimization in domains where direct evaluation is difficult (summarization, creative writing, etc.). However, RLHF is not true reinforcement learning in a perfect sense; reward models can be gamed, adversarial inputs can exploit weaknesses, and improvements can plateau or degrade if used too aggressively. Researchers also observe that RLHF can reveal or amplify the model’s “chain-of-thought” patterns, sometimes producing lengthy, transparent reasoning traces that are not always reliable. The frontier includes longer-horizon, multi-task reasoning, and the integration of more robust evaluation protocols to reduce gaming and align outputs with human intent.

Common Questions

The pre-training stage: download large filtered web corpora (e.g., Common Crawl / FineWeb), preprocess and tokenize the text, then train a model to predict next tokens. (see 63s)

Topics

Mentioned in this video

toolByte Pair Encoding (BPE)

Tokenization algorithm (merging common byte/byte-pair sequences) used to build large vocabularies (~100k tokens).

toolTransformer

Neural network architecture visualized and explained as the core model type for LLMs.

toolNVIDIA H100

High-performance GPU hardware used to train large language models; discussed price and cluster use.

toolHyperbolic

Hosted demo service used to interact with base LLaMA 3.1 405B base in the video.

toolOpenAssistant

Community effort to reproduce SFT datasets similar to InstructGPT (example of open-source conversation datasets).

toolFalcon 7B

Example open model used to demonstrate hallucination behavior (older model prone to fabrications).

toolPython / code interpreter

Tool used by models to reliably compute (e.g., counting, math) by writing code and running it externally.

toolLM Studio

Local application used to run distilled / smaller LLMs locally (demo shown).

toolFineWeb

Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).

tooltiktoken (tokenizer demo)

Website/tool used to inspect tokenization (demonstrated with GPT-style tokenizer).

toolLLaMA 3.1 (405B)

Meta's released base model (LLaMA 3.1 405B parameters) cited as an example of a large, released base model.

studyInstructGPT (2022 paper)

OpenAI paper introducing supervised fine-tuning approaches (human labelers, labeling instructions) for making assistants.

toolWeb search / Bing / Google

Tool integration pattern: model emits special tokens to call web search and paste results into the context window.

toolTogether.ai (playground)

Inference provider used to host open weights models like DeepSeek R1 for public interaction.

bookPride and Prejudice (book)

Example book used to explain the difference between model recollection vs. providing the exact chapter text in context.

toolAlphaGo
toolGPT-2
toolDeepSeek R1

More from Andrej Karpathy

View all 14 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free