Grouped Query Attention
Concept
A modified attention mechanism used in GPT OSS that allows multiple query heads to share key-value pairs, reducing memory use and speeding up inference.
Mentioned in 1 video
A modified attention mechanism used in GPT OSS that allows multiple query heads to share key-value pairs, reducing memory use and speeding up inference.