MLA (Multi-latent attention)
Concept
An algorithm from DeepSeek that turns keys and values into a single latent vector, expanded during inference. It offers a way to reduce KV cache size while maintaining richness.
Mentioned in 1 video
An algorithm from DeepSeek that turns keys and values into a single latent vector, expanded during inference. It offers a way to reduce KV cache size while maintaining richness.