Publications

Distilling vision-language models on millions of videos
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this …