Video Captioning on MSR-VTT

1 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

txh-mercury/cosa 15 Jun 2023

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.