Multi-head attention

Concept

A component of Transformer models. Mentioned in contrast to more efficient attention schemes like group query and multi-query attention, which aim to reduce KV cache size.

Mentioned in 1 video

Videos Mentioning Multi-head attention

Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447

Lex Fridman

A component of Transformer models. Mentioned in contrast to more efficient attention schemes like group query and multi-query attention, which aim to reduce KV cache size.