MMDIT head
ConceptMentioned in 1 video
A multimodal dictionary transformer head that improves feature mixing between vision and action features in VLA models, leading to significant performance boosts.
A multimodal dictionary transformer head that improves feature mixing between vision and action features in VLA models, leading to significant performance boosts.