Grouped Query Attention
Concept
A modified attention mechanism used in GPT OSS that allows multiple query heads to share key-value pairs, reducing memory use and speeding up inference.
Mentioned in 2 videos
Save the 2 videos on Grouped Query Attention to your own pod.
Sign up free to keep building your knowledge base on Grouped Query Attention as more episodes are added.
Videos Mentioning Grouped Query Attention

OpenAI vs. Deepseek vs. Qwen: Comparing Open Source LLM Architectures
Y Combinator
A modified attention mechanism used in GPT OSS that allows multiple query heads to share key-value pairs, reducing memory use and speeding up inference.

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
Stanford Online
An attention mechanism that reduces the KV cache size by using fewer key/value heads, improving latency and throughput.