Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
Key Moments
GPU programming requires deep hardware understanding, as performance hinges on managing memory hierarchy, warp execution, and bank conflicts, not just algorithmic correctness.
Key Insights
NVIDIA GPUs feature a memory hierarchy: registers (fastest, smallest) to shared memory/L1 cache, L2 cache, and finally HBM (slowest, largest).
Threads are organized into thread blocks, which are scheduled onto Streaming Multiprocessors (SMs), enabling threads within a block to communicate via shared memory – crucial for operations like matrix multiplication.
A warp consists of 32 threads executing instructions in lockstep; control divergence (different threads executing different instructions) leads to serialization and inefficiency.
Bank conflicts occur when multiple threads in a warp access the same bank in shared memory simultaneously, forcing serialization and reducing performance.
Benchmarking and profiling are essential iterative processes to identify performance bottlenecks, with CUDA events recommended for accurate GPU timing and PyTorch profiler for kernel insights.
Triton, developed by OpenAI, shifts the programming model to focus on thread blocks, simplifying the process of loading data into shared memory, operating on it, and writing back to global memory.
Understanding the GPU Memory Hierarchy and Execution Model
The lecture begins by detailing the structure of NVIDIA GPUs, emphasizing the memory hierarchy: registers, L1 cache/shared memory, L2 cache, and High Bandwidth Memory (HBM). Registers and L1/shared memory are fast but small and reside on Streaming Multiprocessors (SMs), while L2 and HBM are larger but slower. Performance is directly tied to how effectively code utilizes this hierarchy, prioritizing faster, local memory. The programming model involves threads, grouped into thread blocks (or Concurrent Thread Arrays - CTAs), which collectively form a grid. While element-wise operations map naturally to threads, complex operations requiring thread communication, such as matrix multiplication or softmax, necessitate thread blocks and shared memory to avoid the latency of HBM accesses. Thread blocks are scheduled onto SMs, allowing threads within a block to cooperate using shared memory.
The Critical Role of Warps and Avoiding Control Divergence
Threads within a thread block are further grouped into units called warps, typically containing 32 threads on NVIDIA hardware. All threads within a warp execute the same instruction simultaneously (in lockstep). This design is efficient, but it breaks down when threads within a warp need to execute different instructions – a situation known as control divergence, often caused by conditional branching (if-else statements). When divergence occurs, the SM must serialize the execution paths, drastically reducing performance. Therefore, avoiding control flow that leads to divergence is paramount for efficient GPU programming. The SM's warp scheduler can switch between active warps with zero cost, effectively hiding memory latency by executing another warp while one waits for data from HBM.
Optimizing Shared Memory Access: Bank Conflicts and Coalescing
Shared memory, being on-chip and much faster than HBM, is crucial for performance. However, it is divided into 32 banks, and only one thread can access a given bank per clock cycle. If multiple threads in a warp attempt to access the same bank simultaneously, a bank conflict occurs, serializing their accesses and negating the benefits of shared memory parallelism. This is particularly problematic in operations like accessing columns of a matrix when shared memory is laid out row-wise. Similarly, memory coalescing optimizes HBM access. When threads in a warp access contiguous memory locations, their requests can be combined into a single transaction. Full coalescing, where all threads in a warp access the same cache line, is the most efficient scenario. Understanding and optimizing for bank conflicts and memory coalescing is vital for high-performance kernels.
Benchmarking, Profiling, and the Philosophy of Performance Measurement
The lecture emphasizes a disciplined approach to performance optimization: benchmark and profile your code, make changes, and then benchmark and profile again. Benchmarking provides end-to-end timings, helping to understand overall performance and scaling. Profiling, on the other hand, reveals where time is actually spent within the code, identifying specific kernels or operations that are bottlenecks. For GPU benchmarking, CUDA events are recommended for accurate timing due to the asynchronous nature of GPU operations, requiring synchronization before recording durations. PyTorch's profiler can detail which CUDA kernels are invoked and how tensor dimensions influence kernel selection (e.g., different matmul kernels for different sizes). The key takeaway is to measure before optimizing, as assumptions about performance can be misleading.
Introducing Triton: A Higher-Level Kernel Programming Language
CUDA provides fine-grained control but can be complex, especially when managing thread synchronization and shared memory for operations beyond element-wise tasks. Triton, developed by OpenAI, offers a more streamlined approach by focusing on the thread block as the primary programming unit. In Triton, developers specify what a thread block does: load data into shared memory, perform computations, and write results back to global memory. This abstraction simplifies kernel development, particularly for common deep learning operations. The conceptual model in Triton involves breaking down large computations into manageable blocks that can be efficiently processed using the SM's shared memory, bridging the gap between low-level hardware details and high-level PyTorch operations.
Writing Triton Kernels: From Element-wise to Reductions
The lecture demonstrates writing Triton kernels starting with a simple element-wise operation (analogous to vector addition). The core structure involves waking up a block, determining the data range to operate on using offsets and masks, loading data from HBM via pointer arithmetic, performing computations (e.g., vector operations), and storing the results back to HBM. For operations like softmax, which require reductions (e.g., summing across a row), the strategy adapts. If a row fits within a block, the kernel can load the entire row into shared memory and perform the softmax computation efficiently. When a row is too large to fit into a single block (leading to tiling), the kernel iterates over tiles within the row, accumulating partial results in registers or shared memory before a final reduction step. This iterative processing within a block becomes necessary when data exceeds shared memory capacity.
Matrix Multiplication Kernel: Tiling for Shared Memory Efficiency
Matrix multiplication (matmul), a cornerstone of deep learning, is presented as a complex example requiring careful optimization. A naive matmul implementation that reads from HBM for every element computation has poor arithmetic intensity. The idealized approach of loading entire matrices into shared memory is often infeasible due to size constraints. The practical solution is tiling: breaking down the matrices into smaller tiles that can fit into shared memory. A thread block is assigned to compute a tile of the output matrix. Inside the block, threads load corresponding tiles of the input matrices A and B into shared memory. These tiles are then multiplied, and the result is accumulated in a partial sum (also often in shared memory). This process repeats for all relevant tiles of A and B, effectively mimicking the idealized approach locally while managing larger matrices globally. Kernel fusion, like adding a ReLU activation directly after the matmul calculation within the same kernel, is also shown as an efficient practice.
From Triton to PTX: The Compilation Process and Hardware Control
Triton code is compiled into PTX (Parallel Thread Execution), an intermediate assembly language for NVIDIA GPUs. PTX code represents the low-level instructions executed by threads. While Triton abstracts away explicit thread synchronization and simplifies memory management, PTX reveals how threads operate, including data loading (LD.global), computations, and storing results (ST.global). The compiler can also perform optimizations like thread coarsening, where a single thread processes multiple data elements to improve utilization. Crucially, PTX does not expose all hardware details; aspects like warp scheduling, SM assignment, and exact memory access patterns are managed by the GPU's hardware and driver, remaining somewhat opaque to the programmer. This highlights the interplay between high-level programming models like Triton, intermediate representations like PTX, and the underlying hardware architecture for achieving optimal performance.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
GPU Kernel Programming Best Practices
Practical takeaways from this episode
Do This
Avoid This
GPU Memory Hierarchy Speed Comparison
Data extracted from this episode
| Memory Type | Speed | Size |
|---|---|---|
| Registers | Very Fast | Small |
| L1 Cache / Shared Memory (per SM) | Slightly Less Fast | Small |
| L2 Cache (per chip) | Less Fast | Medium |
| High-Bandwidth Memory (HBM) | Slowest | Large |
Common Questions
Threads are the basic execution units, each processing a part of the data. Threads are grouped into thread blocks (or CTAs), which run concurrently on a Streaming Multiprocessor (SM). A grid is a collection of thread blocks launched to perform a kernel computation.
Topics
Mentioned in this video
A generation of NVIDIA GPUs discussed for its register count, SM features, and tensor memory.
A generation of NVIDIA GPUs discussed in the context of SM counts and features.
The lecture focuses on the architecture and programming model of NVIDIA GPUs, including generations like M100s, H100s, and B200s.
Mentioned as another type of accelerator alongside NVIDIA GPUs and AMD.
High-Bandwidth Memory, a key component of GPU architecture, contrasted with faster local memory like registers and L1 cache.
Mentioned as an alternative library for GPU programming, differing in characteristics from Triton.
A deep learning framework used to demonstrate benchmarking, profiling, and as a foundation for implementing Triton kernels.
NVIDIA's CUDA library for linear algebra, mentioned in the context of profiling PyTorch's matrix multiplication kernels.
NVIDIA's parallel computing platform and programming model, originally developed for writing kernels. Triton is presented as an alternative with higher-level abstractions.
Mentioned as a potential application or goal that the lecture's concepts will enable students to implement.
A PyTorch function used to compile naive implementations into more efficient kernels, often generating Triton code.
The intermediate assembly language for NVIDIA GPUs, generated by compilers like Triton's, and which threads directly execute.
More from Stanford Online
View all 32 summaries
101 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
81 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
82 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism
72 minStanford CS25: Transformers United V6 I From Representation Learning to World Modeling
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free