Grouped Query Attention

Concept

A modified attention mechanism used in GPT OSS that allows multiple query heads to share key-value pairs, reducing memory use and speeding up inference.

Mentioned in 2 videos