How large are production pretraining corpora (approx.)?

Representative filtered corpora can be tens of terabytes of text; FineWeb is cited as ~44 TB, which maps to many trillions of tokens. (see 135s)

What does tokenization (BPE) do and why is vocabulary size important?

Tokenizers group bytes/characters into a finite vocabulary (commonly BPE) to shorten sequence length in exchange for a larger symbol set; modern models use ~100k tokens. (see 715s)

What is the model training objective for LLM pretraining?

Models are trained to predict the next token given a context window sampled from the corpus; parameters are updated to increase likelihood of the true next token. (see 940s)

What is a base model versus an assistant model?

A base model is a token-level internet text simulator (trained on web text). An assistant results from post-training (SFT) and RL steps that teach conversational behavior and preferences. (see 2833s)

How are conversational behaviors taught to an assistant?

Through supervised fine-tuning (SFT): human labelers create prompt–response pairs following labeling instructions, which the model imitates to adopt a helpful/truthful persona. (see 4220s)

Why do LLMs hallucinate and how can that be reduced?

Hallucinations arise because models statistically imitate confident answers seen in training. Mitigations include teaching the model to refuse when uncertain and integrating tool-use (web search) to fetch evidence. (see 4844s)

How do models use tools like web search and code interpreters?

Models can be taught token-level protocols to emit special tokens that trigger external tools; results are pasted into the context window so the model can reason over fresh, verifiable data. (see 5621s)

Why should prompts request intermediate steps rather than a single-token answer?

Each token generation uses a finite amount of compute; distributing reasoning across many tokens (showing intermediate steps) helps models perform multi-step arithmetic/reasoning reliably. (see 6417s)

What is reinforcement learning's role in improving LLM reasoning?

RL lets a model practice many candidate solutions and reinforces traces that lead to correct/verifiable answers; RL can produce emergent chain-of-thought and better problem-solving on verifiable tasks. (see 8973s)

What is RLHF (reinforcement learning from human feedback) and why is it used?

RLHF trains a reward model to approximate human preference (via ranked examples), then uses that model to run RL in domains where direct scoring is hard (e.g., creative writing). It reduces the need for humans to label every rollout. (see 10297s)

What are the limits of RLHF?

Reward models are neural approximators and can be gamed; running RL too long against a learned reward may produce adversarial artifacts, so RLHF runs are carefully controlled. (see 11600s)

Which open models / tools can I try right now?

Open-weight models like LLaMA and DeepSeek R1 are hosted by inference providers (e.g., Together.ai) and Hyperbolic hosts base LLaMA 3.1 demos; LM Studio lets you run distilled/smaller models locally. (see 3143s)

When should I ask an LLM to use code instead of trusting its plain text math?

Ask it to use a code interpreter whenever accuracy is critical (large or multi-step arithmetic, counting, precise string manipulation). The model writes code and executes it externally for robust results. (see 7031s)

How should I think of what ChatGPT returns?

Think of it as a statistical simulation of human labelers trained using company-defined instructions—the output imitates how a labeled expert would respond, augmented by the model's pretraining memory and any post-training behavior. (see 3573s)

Key Moments

Deep Dive into LLMs like ChatGPT

Andrej Karpathy

Science & Technology5 min read212 min video

Feb 5, 2025|8,109,146 views|126,021|4,009

llm chatgpt ai deep dive deep learning introduction large language model

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Data-driven training, tokens, post-training, and RLHF shape ChatGPT-like models; beware hallucinations.

Key Insights

Pre-training uses vast, filtered internet text (e.g., Fine Web, Common Crawl) to create a base model that behaves like an internet document simulator.

Text is converted into a fixed vocabulary of tokens (roughly 100k symbols); tokenization (and its quirks) is central to how models read and generate text.

A base model becomes an assistant through post-training: supervised fine-tuning, instruction tuning, and reinforcement learning with human feedback (RLHF).

Generation is probabilistic: inference samples next tokens from a distribution within a context window; tools like web search and code interpreters augment capabilities.

Hallucinations are common; mitigations include retrieval, uncertainty signaling, system prompts, and explicit refusals; reward models can help but have limitations.

Reinforcement learning in practice favors indirect human feedback (RLHF) rather than pure RL; the approach scales but is prone to gaming and alignment challenges.

PRE-TRAINING PIPELINE AND DATA SOURCES

The journey begins with pre-training, where a model is exposed to enormous volumes of internet text to learn the statistical patterns of language. Data pipelines combine sources like Common Crawl and curated sets such as Fine Web, with rigorous filtering: URL/domain filtering to remove malware and spam, text extraction to strip HTML, language filtering to emphasize English or multilingual coverage, and deduplication plus PIIs removal. The aim is massive, diverse, high-quality text, since the model’s “knowledge” in weight form is a compressed reflection of the internet. In practice, this yields datasets on the order of tens of terabytes and trillions of tokens (e.g., 15 trillion tokens in some datasets), which then become the training material for next-token prediction. The output of this stage is a base model — a powerful internet document simulator that can generate plausible text but isn’t yet an assistant capable of coherent, multi-turn dialogue without further tuning.

TOKENIZATION AND REPRESENTATION

To feed text to neural networks, we must convert it into a finite set of symbols called tokens. Text is turned into a one-dimensional sequence of tokens via encoding schemes (e.g., UTF-8 bits, grouped into bytes, then merged into subword symbols). A key idea is balancing vocabulary size with sequence length: more symbols means shorter sequences but a larger vocabulary. In practice, companies use byte-pair encoding style methods (The Byte-Pair or bit-level encoding plus pair merging) to create about 100,000+ tokens; for example GPT-4 uses around 100,277 tokens. Tokenization is not just a technical footnote—it determines how efficiently the model can represent text and how it handles rare words, capitalization, and spacing. Tokenized forms are the real inputs for both training and inference, and tokenization quirks (like how Hello World may split into different tokens) illustrate how sensitive generation is to token boundaries.

FROM BASE MODEL TO ASSISTANT: POST-TRAINING AND INSTRUCTION TUNING

A base model trained on internet text is not yet a useful assistant. The industry then applies post-training steps to turn the base into a deployable assistant: supervised fine-tuning (SFT), instruction tuning, and reinforcement learning with human feedback (RLHF). SFT uses human-generated prompts and ideal assistant responses to nudge the model toward helpful, honest, and harmless behavior. InstructGPT popularized this approach, and datasets like UltraChat and related mixtures increasingly fuse real and synthetic prompts to cover diverse tasks. RLHF introduces a reward model trained to predict human judgments, enabling policy optimization beyond static imitation. The post-training phase is less compute-heavy than pre-training but still expensive and data-intensive; it also introduces safety and alignment considerations through explicit labeling instructions and system prompts that shape the model’s persona and responses.

INFERENCE, CONTEXT, AND TOOL USE

During inference, the model generates text token by token, sampling from a probability distribution over a large vocabulary conditioned on an ever-growing context window. Modern models manage long contexts (thousands of tokens, sometimes more with optimizations) and balance speed with quality. Importantly, generation is a form of autocomplete, not a perfect oracle. To stay current and capable, these models increasingly use tools: web search to fetch fresh data, Python code interpreters to compute or verify results, and external APIs to perform tasks (e.g., calendar lookups, data fetches). Prompts and system messages shape behavior, including few-shot demonstrations and role-play orchestrations to elicit desired assistant behavior. The idea of in-context learning—solving tasks by example within the prompt—also plays a central role in steering the model’s output without explicit re-training.

HALLUCINATIONS, FACTUALITY, AND MITIGATION

A persistent challenge is hallucination: the model may make up facts or confident-sounding statements because it is a statistical token predictor, not a fact-checking engine. Fallible behavior is illustrated by memorization of certain sources (e.g., Wikipedia) or making up plausible-sounding but false details about obscure people. Mitigations include: explicit refusals when uncertain, retrieval-augmented generation (web search) to refresh memory, use of tools to verify information, and retrieval of source citations. Advanced strategies involve evaluating factuality through internal checks or external judges and integrating uncertainty signals. Even with RLHF, the system remains probabilistic, so users must validate critical outputs. The model’s self-identity and knowledge cutoffs are also nuanced issues; system prompts or hardcoded identity can influence how it presents itself in conversations.

REINFORCEMENT LEARNING: RLHF, REWARD MODELS, LIMITATIONS, AND FUTURE DIRECTIONS

Reinforcement learning in language models often takes the form of RLHF: a reward model learns to score model outputs based on human preferences, enabling the base model to optimize for desired behavior through policy updates. Humans provide ordering or ratings across multiple candidate responses, and a separate reward model learns to imitate these judgments. This indirection dramatically reduces human labeling burden and enables optimization in domains where direct evaluation is difficult (summarization, creative writing, etc.). However, RLHF is not true reinforcement learning in a perfect sense; reward models can be gamed, adversarial inputs can exploit weaknesses, and improvements can plateau or degrade if used too aggressively. Researchers also observe that RLHF can reveal or amplify the model’s “chain-of-thought” patterns, sometimes producing lengthy, transparent reasoning traces that are not always reliable. The frontier includes longer-horizon, multi-task reasoning, and the integration of more robust evaluation protocols to reduce gaming and align outputs with human intent.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Studies Cited

●Concepts

Common Questions

The pre-training stage: download large filtered web corpora (e.g., Common Crawl / FineWeb), preprocess and tokenize the text, then train a model to predict next tokens. (see 63s)

Topics

Byte-pair Encoding Inference Supervised Fine-tuning Hallucination Mitigation Tool Use (web Search, Code)Chain-of-thought Compute (H100 GPUs)Multimodality Agents

Mentioned in this video

Concepts

Byte Pair Encoding (BPE)

Tokenization algorithm (merging common byte/byte-pair sequences) used to build large vocabularies (~100k tokens).

Transformer

Neural network architecture visualized and explained as the core model type for LLMs.

Hyperbolic

Hosted demo service used to interact with base LLaMA 3.1 405B base in the video.

FineWeb

Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).

Products

NVIDIA H100

High-performance GPU hardware used to train large language models; discussed price and cluster use.

Software & Apps

OpenAssistant

Community effort to reproduce SFT datasets similar to InstructGPT (example of open-source conversation datasets).

Falcon 7B

Example open model used to demonstrate hallucination behavior (older model prone to fabrications).

Python / code interpreter

Tool used by models to reliably compute (e.g., counting, math) by writing code and running it externally.

LM Studio

Local application used to run distilled / smaller LLMs locally (demo shown).

tiktoken (tokenizer demo)

Website/tool used to inspect tokenization (demonstrated with GPT-style tokenizer).

LLaMA 3.1 (405B)

Meta's released base model (LLaMA 3.1 405B parameters) cited as an example of a large, released base model.

Web search / Bing / Google

Tool integration pattern: model emits special tokens to call web search and paste results into the context window.

Together.ai (playground)

Inference provider used to host open weights models like DeepSeek R1 for public interaction.

InstructGPT (2022 paper)

OpenAI paper introducing supervised fine-tuning approaches (human labelers, labeling instructions) for making assistants.

Books

Pride and Prejudice (book)

Example book used to explain the difference between model recollection vs. providing the exact chapter text in context.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free