Multi-head attention

Concept

A component of Transformer models. Mentioned in contrast to more efficient attention schemes like group query and multi-query attention, which aim to reduce KV cache size.

Mentioned in 1 video