MMDIT head

ConceptMentioned in 1 video

A multimodal dictionary transformer head that improves feature mixing between vision and action features in VLA models, leading to significant performance boosts.