A Vision Language Action model used as a baseline and compared against the Size Zero model.
Stanford Online