Why is building language models from scratch still relevant in the age of advanced AI?

While prompting models is powerful, abstractions can be leaky, limiting fundamental research. Understanding the full stack by building from scratch is crucial for deep insights and for identifying novel design spaces that might be constrained by current pre-trained models.

What are the main challenges in training frontier language models?

Training frontier models is extremely expensive, requiring massive compute (billions of dollars and immense GPU clusters) and vast datasets. Furthermore, details about their construction are often kept proprietary due to competitive and safety concerns.

How does the CS336 course address the limitations of training small language models?

The course acknowledges that small-scale experiments might not perfectly reflect frontier models due to scaling effects and emergent behaviors. However, it focuses on teaching transferable knowledge like the mechanics of transformers, efficient system design, and a problem-solving mindset.

What is the 'bitter lesson' in AI, and what is its correct interpretation?

The bitter lesson states that algorithms that scale are what truly matter. The correct interpretation isn't that scale is everything, but that scalable algorithms, combined with hardware advancements, drive progress. Efficiency is paramount, especially at large scales.

What is the historical lineage of modern language models?

Modern language models evolved from earlier works like N-gram models and neural network approaches in the 90s (LSTMs). Key milestones include seq-to-seq modeling, attention, the Transformer architecture, and large-scale models like BERT, GPT-2, and GPT-3.

How has the role of language models changed over the last decade?

Language models have evolved from systems to be fine-tuned, to tools to be prompted, and now to conversational agents and even complex agents capable of executing tasks. The demand on their capabilities has dramatically increased.

What are the five main parts of the CS336 course curriculum?

The course is structured into five parts mirroring the assignments: Basics (building and training an LM), Systems (optimizing hardware usage), Scaling Laws (understanding how to scale models), Data (curation and processing), and Alignment (improving models with weak supervision).

What is tokenization and why is Byte Pair Encoding (BPE) important?

Tokenization converts raw text into sequences of integers (tokens) that models can process. BPE is a data-driven method that creates efficient vocabularies by merging frequent byte pairs, reducing sequence length and enabling adaptive computation.

What are the main goals of Assignment 2 in the CS336 course?

Assignment 2 focuses on systems engineering, aiming to help students get the most out of their hardware. This includes understanding kernels, parallelizing across GPUs, and optimizing for inference efficiency.

How does the scaling laws assignment prepare students for large-scale training?

Assignment 3 introduces the concept of a 'scaling recipe'—a mapping from compute budget to hyperparameters. Students learn to fit scaling laws from smaller experiments to predict performance at larger scales, crucial for optimizing expensive training runs.

Why is data quality and curation critical for language models?

Data directly shapes what a model learns and its capabilities. Active curation, processing (transformation, filtering, deduplication), and considering the source and potential legal issues are essential steps for building high-quality language models.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization

Stanford Online

Education6 min read80 min video

Apr 14, 2026|890 views|73|4

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Key Moments

TL;DR

Stanford teaches building LLMs from scratch, emphasizing efficiency and mechanics over current prompting trends, but warns that small-scale insights may not perfectly transfer to frontier models due to emergent behaviors at scale.

Key Insights

The third edition of CS 336 aims to bridge the gap left by 'prompt engineering' by teaching students how to build language models from the ground up, fostering a deeper understanding often lost with higher-level abstractions.

Industrial-scale LLMs like GPT-4 cost upwards of $100 million to train, and their inner workings are often proprietary, making it necessary for educational courses to focus on building smaller, representative models from scratch.

Small-scale LLM experiments may not reflect frontier model behavior, as demonstrated by MLP layers accounting for 44% of FLOPs in small models but 80% in 175B models, and emergent behaviors only appearing above critical scale thresholds.

The 'bitter lesson' in AI is reinterpreted as 'algorithms that scale are what matters,' emphasizing that algorithmic efficiency is crucial, especially at large scales where even a 5% improvement can represent millions of dollars.

The curriculum covers five core areas: basics (tokenization, architecture, training), systems (kernels, parallelism, inference), scaling laws, data (evaluation, curation, processing), and alignment (weak supervision, RL).

Byte Pair Encoding (BPE) is a key tokenization technique, starting with bytes and iteratively merging the most frequent adjacent pairs to create a vocabulary, aiming for data compression and adaptive computation, though full implementations can be slow.

The necessity of building LLMs from the ground up.

Stanford's CS 336 course emphasizes building language models from scratch, a philosophy born from a perceived disconnect between AI researchers and the underlying technology. While prompting models is powerful, abstractions can "leak," leading to limitations. For fundamental research, a deep understanding of the entire stack is crucial. This approach is especially relevant today, as industrial-scale models are prohibitively expensive and their internal details are kept confidential, making hands-on building of smaller, representative models the most effective learning method.

Why small-scale models might not perfectly represent frontier models.

A significant challenge in learning from scratch is that small-scale models may not exhibit the same behaviors as large, frontier models. For instance, the proportion of computational flops dedicated to MLP layers shifts dramatically: from around 44% in smaller models to 80% in a 175 billion parameter model. Furthermore, emergent behaviors, such as improved few-shot or zero-shot learning, are often only observed beyond a specific scale threshold. This means that optimizations or insights gained from small-scale experiments might not directly translate to larger models, necessitating a focus on transferable mechanics and mindset over specific empirical results.

The 'bitter lesson' and the paramount importance of efficiency.

The course reframes the 'bitter lesson' not as scale being the only factor, but as 'algorithms that scale are what matter.' Efficiency, defined as output over input (e.g., accuracy per FLOP), is critical. At large scales, even minor inefficiencies become astronomically expensive. A 5% improvement can translate to millions of dollars saved. The course aims to instill a mindset of prioritizing efficiency, emphasizing profiling, benchmarking, and squeezing maximum performance from hardware. This is supported by historical evidence, such as a 44x algorithmic efficiency gain on ImageNet between 2012 and 2019, which, combined with hardware improvements, led to significant accuracy boosts.

The CS336 curriculum: A five-part journey.

The course is structured around five core assignments, mirroring a progressive learning path: 1. Basics (tokenization, architecture, training), 2. Systems (kernels, parallelism, inference), 3. Scaling Laws, 4. Data (evaluation, curation, processing), and 5. Alignment. This structure ensures students not only understand the theoretical concepts but also gain practical experience in building, optimizing, and scaling language models. The philosophy is 'from scratch,' with assignments providing unit tests for correctness rather than complete scaffolding, encouraging deep engagement with the material.

Tokenization: From raw text to model input.

Tokenization is the process of converting raw text into a sequence of integers that a language model can process. While seemingly simple, it has critical implications for efficiency and model behavior. Traditional methods like character-based or word-based tokenization have significant drawbacks. Character-level tokenization leads to very long sequences and inefficient vocabulary use, while word-based tokenization can result in unbounded vocabularies and unknown tokens ('unk'). The course focuses on Byte Pair Encoding (BPE), a data-driven heuristic that merges frequent byte pairs to create a vocabulary. BPE offers better compression ratios, reduces sequence length (crucial for attention mechanisms), and allows for adaptive computation by representing common phrases as single tokens while breaking down rare ones. Even as research moves towards end-to-end byte-level models, tokenization remains essential due to its ability to abstract and compress information effectively.

Systems: Optimizing for hardware and inference.

The 'Systems' unit delves into optimizing language models for hardware, focusing on resource accounting (FLOPs, memory), kernel optimization, parallelism, and inference. Understanding where computation and memory are spent is crucial, with formulas like '6 * N * D' helping to quantify training costs. Hardware bottlenecks, particularly data movement between memory and compute units, are highlighted. Techniques like operator fusion and tiling are discussed to minimize this data movement. For large-scale distributed training, concepts like sharding data, models, and layers across GPUs are explored, alongside classic collective operations like gather and all-reduce. Inference, increasingly important, is examined through its prefill and decode phases, with strategies to speed it up including pruning, quantization, distillation, and speculative decoding. Assignment two involves implementing kernels in Triton and parallel training.

Scaling laws: Predicting performance at scale.

Scaling laws provide a framework for predicting LLM performance based on compute budget, rather than training a single model. This is vital when a single training run can cost tens of millions of dollars. The key idea is to develop a 'scaling recipe'—a mapping from a FLOP budget to hyperparameters—by running smaller experiments, fitting scaling laws, and projecting results to larger scales. Predictability at scale is as important as optimality, achieved through hyperparameter transferability across different model sizes. Classic scaling laws, like those proposed by Kaplan et al. and Chinchilla, balance the number of parameters against the number of training tokens for a given FLOP budget. A common rule of thumb suggests training for roughly 20 tokens per parameter (e.g., 1.4 trillion tokens for a 70B model), though this can vary and doesn't account for inference cost considerations. Assignment three simulates high-stakes training scenarios using a caching mechanism, allowing students to fit scaling laws and extrapolate performance within a budget.

Data and alignment: Guiding model behavior.

The 'Data' unit emphasizes that data quality dictates model quality and desired capabilities. It covers evaluation metrics (internal like perplexity for development, and external for reporting), focusing on ecological validity and avoiding data contamination. Data curation involves actively collecting diverse sources like web pages, books, code, and papers, while also navigating legal and ethical considerations such as copyright. Data processing includes transformation, filtering, deduplication, and potentially synthetic data generation, applicable to pre-training, mid-training (high-quality, long-context data), and post-training (conversational data). The 'Alignment' unit addresses improving model behavior beyond next-token prediction using weak supervision. Techniques like Reinforcement Learning (RL) from human feedback (RLHF) or Direct Preference Optimization (DPO) are discussed, where models are updated to prefer preferred responses. Challenges include RL instability and system complexity for large-scale throughput, with future discussions planned on these aspects.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Common Questions

The core philosophy of the CS336 course is to build language models from scratch. This hands-on approach ensures students deeply understand the underlying mechanics and develop strong engineering skills, rather than relying solely on high-level abstractions.

Topics

AI & Machine Learning Technology & Innovation Neural Networks Scaling Laws Language Modeling Transformer Architecture GPU Computing Data Processing Computational Efficiency

Mentioned in this video

People

Percy Liang

One of the instructors of the CS336 language modeling course, with 20 years of experience in language models.

Yoshua Bengio

Authored the first neural language model paper in 2003, which used a feedforward network.

Andrej Karpathy

Creator of a highly recommended video on tokenization.

Concepts

Mixture-of-Experts

An architectural paradigm for building compute-efficient transformers, discussed as a topic for the current year's course.

N-gram models

Statistical models previously used in machine translation and speech recognition to ensure fluid text generation.

Byte Pair Encoding

A data compression technique adapted for language models, used to construct vocabularies tailored to data by merging frequent byte pairs.

Unicode

A character encoding standard used for representing text, which tokenizers convert to integer sequences.

Software & Apps

BERT

A language model that was commonly fine-tuned on downstream tasks in the past.

GPT-4

A frontier language model reportedly costing hundreds of millions to a billion dollars to train, with undisclosed training details.

GPT-2

A scaled-up version of GPT, mentioned as an early example of large language model scaling.

GPT-3

A massive language model trained by OpenAI that demonstrated emergent behavior like in-context learning.

Llama

A series of open-weight language models developed by Meta, significantly impacting the open ecosystem.

Qwen

A prominent Chinese language model that is part of the open-weight model ecosystem.

ChatGPT

Represents a shift in language model interaction, moving from fine-tuning/prompting to conversational interfaces.

Python

A programming language mentioned in the context of the executable lecture format and for implementing BPE tokenizers.

cs336.stanford.edu

The official website for the CS336 language modeling course, containing course information.

cs224n

A Stanford NLP class whose assignments are compared to the workload of CS336.

Claude

An AI model that cannot be used to directly complete CS336 assignments to avoid hindering learning.

Gated DeltaNet

A state space model or linear attention model that has become popular in recent years.

GPT-5

A hypothetical future tokenizer referenced as an example for current tokenizer capabilities.

LSTMs

Long Short-Term Memory networks, a type of neural architecture significant in the 90s development of language models.

Rust

A programming language suggested for implementing high-performance tokenizers if Python proves too slow.

Adam optimizer

An optimization algorithm used in neural networks, mentioned as part of the development lineage of modern language models.

Elmo

An early language model trained on large text corpora, which could be fine-tuned for downstream tasks.

Mamba

A state space model or linear attention model that has become popular in recent years.

nanoGPT

A project whose speedruns are compared to the leaderboard challenge in Assignment 1.

Muon

An optimizer increasingly used in training the latest open models, such as the Kimmy K2 models.

PyTorch

A deep learning framework whose primitives correspond to launching GPU kernels; students use it implicitly.

H net

A recent work on end-to-end byte-level operation that shows promise but has not yet been scaled to frontier models.

Companies

OpenAI

The organization that embraced scaling and released GPT papers, significantly influencing the development of large language models.

DeepMind

An AI research lab that figured out optimal compute and scaling laws, mentioned in the context of Google's response to GPT-3.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free