Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization

Stanford OnlineStanford Online
Education6 min read80 min video
Apr 14, 2026|890 views|73|4
Save to Pod
TL;DR

Stanford teaches building LLMs from scratch, emphasizing efficiency and mechanics over current prompting trends, but warns that small-scale insights may not perfectly transfer to frontier models due to emergent behaviors at scale.

Key Insights

1

The third edition of CS 336 aims to bridge the gap left by 'prompt engineering' by teaching students how to build language models from the ground up, fostering a deeper understanding often lost with higher-level abstractions.

2

Industrial-scale LLMs like GPT-4 cost upwards of $100 million to train, and their inner workings are often proprietary, making it necessary for educational courses to focus on building smaller, representative models from scratch.

3

Small-scale LLM experiments may not reflect frontier model behavior, as demonstrated by MLP layers accounting for 44% of FLOPs in small models but 80% in 175B models, and emergent behaviors only appearing above critical scale thresholds.

4

The 'bitter lesson' in AI is reinterpreted as 'algorithms that scale are what matters,' emphasizing that algorithmic efficiency is crucial, especially at large scales where even a 5% improvement can represent millions of dollars.

5

The curriculum covers five core areas: basics (tokenization, architecture, training), systems (kernels, parallelism, inference), scaling laws, data (evaluation, curation, processing), and alignment (weak supervision, RL).

6

Byte Pair Encoding (BPE) is a key tokenization technique, starting with bytes and iteratively merging the most frequent adjacent pairs to create a vocabulary, aiming for data compression and adaptive computation, though full implementations can be slow.

The necessity of building LLMs from the ground up.

Stanford's CS 336 course emphasizes building language models from scratch, a philosophy born from a perceived disconnect between AI researchers and the underlying technology. While prompting models is powerful, abstractions can "leak," leading to limitations. For fundamental research, a deep understanding of the entire stack is crucial. This approach is especially relevant today, as industrial-scale models are prohibitively expensive and their internal details are kept confidential, making hands-on building of smaller, representative models the most effective learning method.

Why small-scale models might not perfectly represent frontier models.

A significant challenge in learning from scratch is that small-scale models may not exhibit the same behaviors as large, frontier models. For instance, the proportion of computational flops dedicated to MLP layers shifts dramatically: from around 44% in smaller models to 80% in a 175 billion parameter model. Furthermore, emergent behaviors, such as improved few-shot or zero-shot learning, are often only observed beyond a specific scale threshold. This means that optimizations or insights gained from small-scale experiments might not directly translate to larger models, necessitating a focus on transferable mechanics and mindset over specific empirical results.

The 'bitter lesson' and the paramount importance of efficiency.

The course reframes the 'bitter lesson' not as scale being the only factor, but as 'algorithms that scale are what matter.' Efficiency, defined as output over input (e.g., accuracy per FLOP), is critical. At large scales, even minor inefficiencies become astronomically expensive. A 5% improvement can translate to millions of dollars saved. The course aims to instill a mindset of prioritizing efficiency, emphasizing profiling, benchmarking, and squeezing maximum performance from hardware. This is supported by historical evidence, such as a 44x algorithmic efficiency gain on ImageNet between 2012 and 2019, which, combined with hardware improvements, led to significant accuracy boosts.

The CS336 curriculum: A five-part journey.

The course is structured around five core assignments, mirroring a progressive learning path: 1. Basics (tokenization, architecture, training), 2. Systems (kernels, parallelism, inference), 3. Scaling Laws, 4. Data (evaluation, curation, processing), and 5. Alignment. This structure ensures students not only understand the theoretical concepts but also gain practical experience in building, optimizing, and scaling language models. The philosophy is 'from scratch,' with assignments providing unit tests for correctness rather than complete scaffolding, encouraging deep engagement with the material.

Tokenization: From raw text to model input.

Tokenization is the process of converting raw text into a sequence of integers that a language model can process. While seemingly simple, it has critical implications for efficiency and model behavior. Traditional methods like character-based or word-based tokenization have significant drawbacks. Character-level tokenization leads to very long sequences and inefficient vocabulary use, while word-based tokenization can result in unbounded vocabularies and unknown tokens ('unk'). The course focuses on Byte Pair Encoding (BPE), a data-driven heuristic that merges frequent byte pairs to create a vocabulary. BPE offers better compression ratios, reduces sequence length (crucial for attention mechanisms), and allows for adaptive computation by representing common phrases as single tokens while breaking down rare ones. Even as research moves towards end-to-end byte-level models, tokenization remains essential due to its ability to abstract and compress information effectively.

Systems: Optimizing for hardware and inference.

The 'Systems' unit delves into optimizing language models for hardware, focusing on resource accounting (FLOPs, memory), kernel optimization, parallelism, and inference. Understanding where computation and memory are spent is crucial, with formulas like '6 * N * D' helping to quantify training costs. Hardware bottlenecks, particularly data movement between memory and compute units, are highlighted. Techniques like operator fusion and tiling are discussed to minimize this data movement. For large-scale distributed training, concepts like sharding data, models, and layers across GPUs are explored, alongside classic collective operations like gather and all-reduce. Inference, increasingly important, is examined through its prefill and decode phases, with strategies to speed it up including pruning, quantization, distillation, and speculative decoding. Assignment two involves implementing kernels in Triton and parallel training.

Scaling laws: Predicting performance at scale.

Scaling laws provide a framework for predicting LLM performance based on compute budget, rather than training a single model. This is vital when a single training run can cost tens of millions of dollars. The key idea is to develop a 'scaling recipe'—a mapping from a FLOP budget to hyperparameters—by running smaller experiments, fitting scaling laws, and projecting results to larger scales. Predictability at scale is as important as optimality, achieved through hyperparameter transferability across different model sizes. Classic scaling laws, like those proposed by Kaplan et al. and Chinchilla, balance the number of parameters against the number of training tokens for a given FLOP budget. A common rule of thumb suggests training for roughly 20 tokens per parameter (e.g., 1.4 trillion tokens for a 70B model), though this can vary and doesn't account for inference cost considerations. Assignment three simulates high-stakes training scenarios using a caching mechanism, allowing students to fit scaling laws and extrapolate performance within a budget.

Data and alignment: Guiding model behavior.

The 'Data' unit emphasizes that data quality dictates model quality and desired capabilities. It covers evaluation metrics (internal like perplexity for development, and external for reporting), focusing on ecological validity and avoiding data contamination. Data curation involves actively collecting diverse sources like web pages, books, code, and papers, while also navigating legal and ethical considerations such as copyright. Data processing includes transformation, filtering, deduplication, and potentially synthetic data generation, applicable to pre-training, mid-training (high-quality, long-context data), and post-training (conversational data). The 'Alignment' unit addresses improving model behavior beyond next-token prediction using weak supervision. Techniques like Reinforcement Learning (RL) from human feedback (RLHF) or Direct Preference Optimization (DPO) are discussed, where models are updated to prefer preferred responses. Challenges include RL instability and system complexity for large-scale throughput, with future discussions planned on these aspects.

Common Questions

The core philosophy of the CS336 course is to build language models from scratch. This hands-on approach ensures students deeply understand the underlying mechanics and develop strong engineering skills, rather than relying solely on high-level abstractions.

Topics

Mentioned in this video

Software & Apps
BERT

A language model that was commonly fine-tuned on downstream tasks in the past.

GPT-4

A frontier language model reportedly costing hundreds of millions to a billion dollars to train, with undisclosed training details.

GPT-2

A scaled-up version of GPT, mentioned as an early example of large language model scaling.

GPT-3

A massive language model trained by OpenAI that demonstrated emergent behavior like in-context learning.

Llama

A series of open-weight language models developed by Meta, significantly impacting the open ecosystem.

Qwen

A prominent Chinese language model that is part of the open-weight model ecosystem.

ChatGPT

Represents a shift in language model interaction, moving from fine-tuning/prompting to conversational interfaces.

Python

A programming language mentioned in the context of the executable lecture format and for implementing BPE tokenizers.

cs336.stanford.edu

The official website for the CS336 language modeling course, containing course information.

cs224n

A Stanford NLP class whose assignments are compared to the workload of CS336.

Claude

An AI model that cannot be used to directly complete CS336 assignments to avoid hindering learning.

Gated DeltaNet

A state space model or linear attention model that has become popular in recent years.

GPT-5

A hypothetical future tokenizer referenced as an example for current tokenizer capabilities.

LSTMs

Long Short-Term Memory networks, a type of neural architecture significant in the 90s development of language models.

Rust

A programming language suggested for implementing high-performance tokenizers if Python proves too slow.

Adam optimizer

An optimization algorithm used in neural networks, mentioned as part of the development lineage of modern language models.

Elmo

An early language model trained on large text corpora, which could be fine-tuned for downstream tasks.

Mamba

A state space model or linear attention model that has become popular in recent years.

nanoGPT

A project whose speedruns are compared to the leaderboard challenge in Assignment 1.

Muon

An optimizer increasingly used in training the latest open models, such as the Kimmy K2 models.

PyTorch

A deep learning framework whose primitives correspond to launching GPU kernels; students use it implicitly.

H net

A recent work on end-to-end byte-level operation that shows promise but has not yet been scaled to frontier models.

More from Stanford Online

View all 19 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free