Video Captioning on MSR-VTT
1 papers with code • 0 benchmarks • 0 datasets
This task has no description! Would you like to contribute one?
Benchmarks
These leaderboards are used to track progress in Video Captioning on MSR-VTT
No evaluation results yet. Help compare methods by
submitting
evaluation metrics.
Most implemented papers
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.