Why is shared memory important for GPU performance?

Shared memory is faster than High-Bandwidth Memory (HBM) and is local to an SM. It allows threads within a thread block to communicate and share data efficiently, avoiding slow HBM reads and writes for intermediate computations, which is crucial for operations like matrix multiplication.

What are GPU warps and why is warp divergence an issue?

Warps are groups of 32 threads that execute instructions in lockstep. Warp divergence occurs when threads within a warp need to execute different instructions (e.g., due to branching), causing serialization and reducing performance.

What is GPU occupancy and how does register usage affect it?

Occupancy refers to how many warps can be actively running on an SM. If threads use more registers, fewer threads or blocks can fit on the SM, potentially reducing occupancy. However, fewer threads doing more work (thread coarsening) can sometimes improve performance.

What are bank conflicts in shared memory?

Bank conflicts happen when multiple threads in a warp try to access the same bank in shared memory simultaneously. Since each bank can only be accessed by one thread per clock cycle, these accesses are serialized, slowing down computation.

What is memory coalescing in HBM access?

Memory coalescing is the process where sequential memory accesses from threads in a warp are combined into a single transaction to fetch data efficiently from HBM. Full coalescing occurs when all threads access the same cache line.

Why is benchmarking and profiling essential for GPU programming?

Benchmarking and profiling help identify performance bottlenecks by measuring execution time and showing where time is spent. This understanding is critical for optimizing kernels, especially given the complexity and hardware-specific nature of GPUs.

How does PyTorch's torch.compile optimize kernels?

torch.compile analyzes PyTorch code and uses compilers like Triton to fuse operations, transforming multiple kernels and HBM reads/writes into a single, more efficient kernel, significantly improving performance.

What is the main advantage of using Triton over raw CUDA?

Triton provides a higher-level abstraction, allowing developers to think in terms of thread blocks and operations on blocks rather than individual threads and explicit synchronization, simplifying the development of complex GPU kernels.

What is PTX and how does it relate to Triton?

PTX (Parallel Thread Execution) is the intermediate assembly language for NVIDIA GPUs. The Triton compiler translates Triton code into PTX, which is then further compiled by NVIDIA's tools into machine code that executes on the GPU.

How does Triton handle matrix multiplication (matmul)?

Triton uses tiling for matmul. It breaks matrices into smaller tiles that fit into shared memory, loads them, performs local matrix multiplications, accumulates results, and writes the final output tiles back to HBM. This balances global naive access with local idealized shared memory access.

What is kernel fusion and why is it beneficial?

Kernel fusion combines multiple operations (like a matmul and an element-wise activation function) into a single GPU kernel. This reduces the number of HBM read/write operations and kernel launch overhead, improving overall performance.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA

Stanford Online

Education6 min read87 min video

Apr 28, 2026|513 views|28|1

Stanford Stanford Online AI Artificial Intelligence

Save to Pod

Key Moments

On this page

TL;DR

GPU programming requires deep hardware understanding, as performance hinges on managing memory hierarchy, warp execution, and bank conflicts, not just algorithmic correctness.

Key Insights

NVIDIA GPUs feature a memory hierarchy: registers (fastest, smallest) to shared memory/L1 cache, L2 cache, and finally HBM (slowest, largest).

Threads are organized into thread blocks, which are scheduled onto Streaming Multiprocessors (SMs), enabling threads within a block to communicate via shared memory – crucial for operations like matrix multiplication.

A warp consists of 32 threads executing instructions in lockstep; control divergence (different threads executing different instructions) leads to serialization and inefficiency.

Bank conflicts occur when multiple threads in a warp access the same bank in shared memory simultaneously, forcing serialization and reducing performance.

Benchmarking and profiling are essential iterative processes to identify performance bottlenecks, with CUDA events recommended for accurate GPU timing and PyTorch profiler for kernel insights.

Triton, developed by OpenAI, shifts the programming model to focus on thread blocks, simplifying the process of loading data into shared memory, operating on it, and writing back to global memory.

Understanding the GPU Memory Hierarchy and Execution Model

The lecture begins by detailing the structure of NVIDIA GPUs, emphasizing the memory hierarchy: registers, L1 cache/shared memory, L2 cache, and High Bandwidth Memory (HBM). Registers and L1/shared memory are fast but small and reside on Streaming Multiprocessors (SMs), while L2 and HBM are larger but slower. Performance is directly tied to how effectively code utilizes this hierarchy, prioritizing faster, local memory. The programming model involves threads, grouped into thread blocks (or Concurrent Thread Arrays - CTAs), which collectively form a grid. While element-wise operations map naturally to threads, complex operations requiring thread communication, such as matrix multiplication or softmax, necessitate thread blocks and shared memory to avoid the latency of HBM accesses. Thread blocks are scheduled onto SMs, allowing threads within a block to cooperate using shared memory.

The Critical Role of Warps and Avoiding Control Divergence

Threads within a thread block are further grouped into units called warps, typically containing 32 threads on NVIDIA hardware. All threads within a warp execute the same instruction simultaneously (in lockstep). This design is efficient, but it breaks down when threads within a warp need to execute different instructions – a situation known as control divergence, often caused by conditional branching (if-else statements). When divergence occurs, the SM must serialize the execution paths, drastically reducing performance. Therefore, avoiding control flow that leads to divergence is paramount for efficient GPU programming. The SM's warp scheduler can switch between active warps with zero cost, effectively hiding memory latency by executing another warp while one waits for data from HBM.

Optimizing Shared Memory Access: Bank Conflicts and Coalescing

Shared memory, being on-chip and much faster than HBM, is crucial for performance. However, it is divided into 32 banks, and only one thread can access a given bank per clock cycle. If multiple threads in a warp attempt to access the same bank simultaneously, a bank conflict occurs, serializing their accesses and negating the benefits of shared memory parallelism. This is particularly problematic in operations like accessing columns of a matrix when shared memory is laid out row-wise. Similarly, memory coalescing optimizes HBM access. When threads in a warp access contiguous memory locations, their requests can be combined into a single transaction. Full coalescing, where all threads in a warp access the same cache line, is the most efficient scenario. Understanding and optimizing for bank conflicts and memory coalescing is vital for high-performance kernels.

Benchmarking, Profiling, and the Philosophy of Performance Measurement

The lecture emphasizes a disciplined approach to performance optimization: benchmark and profile your code, make changes, and then benchmark and profile again. Benchmarking provides end-to-end timings, helping to understand overall performance and scaling. Profiling, on the other hand, reveals where time is actually spent within the code, identifying specific kernels or operations that are bottlenecks. For GPU benchmarking, CUDA events are recommended for accurate timing due to the asynchronous nature of GPU operations, requiring synchronization before recording durations. PyTorch's profiler can detail which CUDA kernels are invoked and how tensor dimensions influence kernel selection (e.g., different matmul kernels for different sizes). The key takeaway is to measure before optimizing, as assumptions about performance can be misleading.

Introducing Triton: A Higher-Level Kernel Programming Language

CUDA provides fine-grained control but can be complex, especially when managing thread synchronization and shared memory for operations beyond element-wise tasks. Triton, developed by OpenAI, offers a more streamlined approach by focusing on the thread block as the primary programming unit. In Triton, developers specify what a thread block does: load data into shared memory, perform computations, and write results back to global memory. This abstraction simplifies kernel development, particularly for common deep learning operations. The conceptual model in Triton involves breaking down large computations into manageable blocks that can be efficiently processed using the SM's shared memory, bridging the gap between low-level hardware details and high-level PyTorch operations.

Writing Triton Kernels: From Element-wise to Reductions

The lecture demonstrates writing Triton kernels starting with a simple element-wise operation (analogous to vector addition). The core structure involves waking up a block, determining the data range to operate on using offsets and masks, loading data from HBM via pointer arithmetic, performing computations (e.g., vector operations), and storing the results back to HBM. For operations like softmax, which require reductions (e.g., summing across a row), the strategy adapts. If a row fits within a block, the kernel can load the entire row into shared memory and perform the softmax computation efficiently. When a row is too large to fit into a single block (leading to tiling), the kernel iterates over tiles within the row, accumulating partial results in registers or shared memory before a final reduction step. This iterative processing within a block becomes necessary when data exceeds shared memory capacity.

Matrix Multiplication Kernel: Tiling for Shared Memory Efficiency

Matrix multiplication (matmul), a cornerstone of deep learning, is presented as a complex example requiring careful optimization. A naive matmul implementation that reads from HBM for every element computation has poor arithmetic intensity. The idealized approach of loading entire matrices into shared memory is often infeasible due to size constraints. The practical solution is tiling: breaking down the matrices into smaller tiles that can fit into shared memory. A thread block is assigned to compute a tile of the output matrix. Inside the block, threads load corresponding tiles of the input matrices A and B into shared memory. These tiles are then multiplied, and the result is accumulated in a partial sum (also often in shared memory). This process repeats for all relevant tiles of A and B, effectively mimicking the idealized approach locally while managing larger matrices globally. Kernel fusion, like adding a ReLU activation directly after the matmul calculation within the same kernel, is also shown as an efficient practice.

From Triton to PTX: The Compilation Process and Hardware Control

Triton code is compiled into PTX (Parallel Thread Execution), an intermediate assembly language for NVIDIA GPUs. PTX code represents the low-level instructions executed by threads. While Triton abstracts away explicit thread synchronization and simplifies memory management, PTX reveals how threads operate, including data loading (LD.global), computations, and storing results (ST.global). The compiler can also perform optimizations like thread coarsening, where a single thread processes multiple data elements to improve utilization. Crucially, PTX does not expose all hardware details; aspects like warp scheduling, SM assignment, and exact memory access patterns are managed by the GPU's hardware and driver, remaining somewhat opaque to the programmer. This highlights the interplay between high-level programming models like Triton, intermediate representations like PTX, and the underlying hardware architecture for achieving optimal performance.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

GPU Kernel Programming Best Practices

Practical takeaways from this episode

Do This

Benchmark and profile your code before writing kernels to identify bottlenecks.

Use CUDA events for accurate GPU benchmarking, including warming up the GPU.

Leverage shared memory for inter-thread communication within a thread block.

Understand memory hierarchy (registers, L1, L2, HBM) and their bandwidths.

Break down complex operations into thread blocks and tiles for efficient processing.

Consider using Triton for higher-level kernel abstraction, especially for transformer-related operations.

When data exceeds shared memory capacity, use tiling and iteration over tiles.

Fuse element-wise activation functions into matmul kernels for efficiency.

Ensure proper memory coalescing for efficient HBM access.

Avoid This

Avoid control divergence within warps – it leads to serialization and inefficiency.

Don't assume higher occupancy is always better; consider register usage and thread work.

Don't rely solely on the programming model; deep hardware understanding is crucial for performance.

Avoid naive implementations that involve excessive reads and writes to HBM.

Do not write PTX code unless absolutely necessary and you are sure you can outperform the compiler.

Avoid implementing complex operations that require inter-thread synchronization in raw CUDA if Triton can provide a simpler abstraction.

GPU Memory Hierarchy Speed Comparison

Data extracted from this episode

Memory Type	Speed	Size
Registers	Very Fast	Small
L1 Cache / Shared Memory (per SM)	Slightly Less Fast	Small
L2 Cache (per chip)	Less Fast	Medium
High-Bandwidth Memory (HBM)	Slowest	Large

Common Questions

Threads are the basic execution units, each processing a part of the data. Threads are grouped into thread blocks (or CTAs), which run concurrently on a Streaming Multiprocessor (SM). A grid is a collection of thread blocks launched to perform a kernel computation.

Topics

Technology & Innovation Programming & Software GPU Computing Matrix Multiplication Deep Learning Hardware CUDA Programming Parallel Programming Triton Language Kernel Optimization Performance Profiling

Mentioned in this video

Products

B200

A generation of NVIDIA GPUs discussed for its register count, SM features, and tensor memory.

H100

A generation of NVIDIA GPUs discussed in the context of SM counts and features.

NVIDIA GPUs

The lecture focuses on the architecture and programming model of NVIDIA GPUs, including generations like M100s, H100s, and B200s.

TPUs

Mentioned as another type of accelerator alongside NVIDIA GPUs and AMD.

HBM

High-Bandwidth Memory, a key component of GPU architecture, contrasted with faster local memory like registers and L1 cache.

Software & Apps

ThunderKittens

Mentioned as an alternative library for GPU programming, differing in characteristics from Triton.

PyTorch

A deep learning framework used to demonstrate benchmarking, profiling, and as a foundation for implementing Triton kernels.

CUTLASS

NVIDIA's CUDA library for linear algebra, mentioned in the context of profiling PyTorch's matrix multiplication kernels.

CUDA

NVIDIA's parallel computing platform and programming model, originally developed for writing kernels. Triton is presented as an alternative with higher-level abstractions.

flash attention

Mentioned as a potential application or goal that the lecture's concepts will enable students to implement.

torch.compile

A PyTorch function used to compile naive implementations into more efficient kernels, often generating Triton code.

PTX

The intermediate assembly language for NVIDIA GPUs, generated by compilers like Triton's, and which threads directly execute.

Organizations

Triton

A programming language for writing GPU kernels, presented as a more accessible alternative to CUDA for complex operations like matrix multiplication and softmax.

Companies

OpenAI

The developer of Triton, a language for writing GPU kernels.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free