TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Paragraph Captioning	Image Paragraph Captioning	Depth-aware Attention Model (DAM)	BLEU-4	6.7	# 9
Image Paragraph Captioning	Image Paragraph Captioning	Depth-aware Attention Model (DAM)	METEOR	13.9	# 9
Image Paragraph Captioning	Image Paragraph Captioning	Depth-aware Attention Model (DAM)	CIDEr	17.3	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/look-deeper-see-richer-depth-aware-image/image-paragraph-captioning-on-image-paragraph)](https://paperswithcode.com/sota/image-paragraph-captioning-on-image-paragraph?p=look-deeper-see-richer-depth-aware-image)`

Look Deeper See Richer: Depth-aware Image Paragraph Captioning

ACM International Conference on Multimedia 2018 · Ziwei Wang, Yadan Luo, Yang Li, Zi Huang, Hongzhi Yin ·

With the widespread availability of image captioning at a sentence level, how to automatically generate image paragraphs is yet well explored. Describing an image by a full paragraph involves organising sentences orderly, coherently and diversely, inevitably leading higher complexity than by a single sentence. Existing image paragraph captioning methods give a series of sentences to represent the objects and regions of interests, where the descriptions are essentially generated by feeding the image fragments containing objects and regions into conventional image single-sentence captioning models. This strategy is difficult to generate the descriptions that guarantee the stereoscopic hierarchy and non-overlapping objects. In this paper, we propose a Depth-aware Attention Model (\textitDAM ) to generate paragraph captions for images. The depths of image areas are firstly estimated in order to discriminate objects in a range of spatial locations, which can further guide the linguistic decoder to reveal spatial relationships among objects. This model completes the paragraph in a logical and coherent manner. By incorporating the attention mechanism, the learned model swiftly shifts the sentence focus during paragraph generation, whilst avoiding verbose descriptions on a same object. Extensive quantitative experiments and the user study have been conducted on the Visual Genome dataset, which demonstrate the effectiveness and the interpretability of the proposed model.

PDF Abstract