A dataset used for fine-tuning the fast tokenizer and pre-training the vision-language model.
Stanford Online