TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	EgoSchema (fullset)	SeViLA (4B)	Accuracy	22.7	# 13
Zero-Shot Video Question Answer	EgoSchema (subset)	SeViLA (4B)	Accuracy	25.7	# 7
Zero-Shot Video Question Answer	IntentQA	SeViLA (4B)	Accuracy	60.9	# 4
Zero-Shot Video Question Answer	NExT-QA	Sevila (4B)	Accuracy	63.6	# 8
Video Question Answering	NExT-QA	SeViLA	Accuracy	73.8	# 6
Zero-Shot Video Question Answer	STAR Benchmark	Sevila	Accuracy	42.2	# 4
Video Question Answering	STAR Benchmark	SeViLA	Average Accuracy	64.9	# 3
Zero-Shot Video Question Answer	TVQA	SEVILA (no speech)	Accuracy	38.2	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/video-question-answering-on-situated)](https://paperswithcode.com/sota/video-question-answering-on-situated?p=self-chained-image-language-model-for-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/zero-shot-video-question-answer-on-intentqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-intentqa?p=self-chained-image-language-model-for-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/zero-shot-video-question-answer-on-star-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-star-1?p=self-chained-image-language-model-for-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/video-question-answering-on-next-qa)](https://paperswithcode.com/sota/video-question-answering-on-next-qa?p=self-chained-image-language-model-for-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/zero-shot-video-question-answer-on-tvqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-tvqa?p=self-chained-image-language-model-for-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/zero-shot-video-question-answer-on-egoschema)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema?p=self-chained-image-language-model-for-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/zero-shot-video-question-answer-on-next-qa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-next-qa?p=self-chained-image-language-model-for-video-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/self-chained-image-language-model-for-video-1/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=self-chained-image-language-model-for-video-1)`

Self-Chained Image-Language Model for Video Localization and Question Answering

NeurIPS 2023 · Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal ·

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

yui010206/sevila official

↳ Quickstart in

Spaces

165

Tasks

Add Remove

Language Modelling

Question Answering

Representation Learning

Temporal Localization

Video Question Answering

Zero-Shot Video Question Answer

Datasets

TVQA

NExT-QA EgoSchema

How2QA IntentQA

STAR Benchmark

VLEP

Results from the Paper

Add Remove

Ranked #3 on Video Question Answering on STAR Benchmark

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	EgoSchema (fullset)	SeViLA (4B)	Accuracy	22.7	# 13	Compare
Zero-Shot Video Question Answer	EgoSchema (subset)	SeViLA (4B)	Accuracy	25.7	# 7	Compare
Zero-Shot Video Question Answer	IntentQA	SeViLA (4B)	Accuracy	60.9	# 4	Compare
Zero-Shot Video Question Answer	NExT-QA	Sevila (4B)	Accuracy	63.6	# 8	Compare
Video Question Answering	NExT-QA	SeViLA	Accuracy	73.8	# 6	Compare
Zero-Shot Video Question Answer	STAR Benchmark	Sevila	Accuracy	42.2	# 4	Compare
Video Question Answering	STAR Benchmark	SeViLA	Average Accuracy	64.9	# 3	Compare
Zero-Shot Video Question Answer	TVQA	SEVILA (no speech)	Accuracy	38.2	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Self-Chained Image-Language Model for Video Localization and Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove