Published on Tue Jul 21 2020

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Jie Wu, Tianshui Chen, Hefeng Wu, Zhi Yang, Guangchun Luo, Liang Lin

Existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases. We propose a novel global-local discriminative objective that is formulated on top of a reference model. We evaluate the proposed method on the widely used MS-COCO dataset.

0
0
0
Abstract

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Figure 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

Sun Jun 21 2020
Computer Vision
Improving Image Captioning with Better Use of Captions
Image captioning is a multimodal problem that has drawn extensive attention. We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
0
0
0
Tue Nov 27 2018
Computer Vision
Unsupervised Image Captioning
The paper is the first attempt to train an image captioning model in an unsupervised manner. Instead of relying on manually labeled image-sentence pairs, our proposed model merely requires an image set, a sentence corpus and an existing visual concept detector.
1
0
1
Fri Oct 06 2017
Computer Vision
Contrastive Learning for Image Captioning
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. The distinctiveness of natural descriptions is often overlooked in previous work. In this work, we propose a new learning method for image captioning.
0
0
0
Fri Feb 28 2020
Machine Learning
Exploring and Distilling Cross-Modal Information for Image Captioning
The Global-and-Local Information approach explores and distills the source of information in vision and language. The Transformer-based model achieves a CIDEr score of 129.3 in offline COCO evaluation. It provides the aspect vector, a spatial and relational representation of images.
0
0
0
Wed Jul 14 2021
Computer Vision
From Show to Tell: A Survey on Image Captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. In the last few years, a large research effort has been devoted to image captioning. This work aims to provide a comprehensive overview and categorization of image Captioning approaches.
7
82
235
Thu May 25 2017
Computer Vision
Deep image representations using caption generators
Deep learning exploits large volumes of labeled data to learn powerful models. Pre-trained CNNs for image recognition are provided with limited information about the image during training, which is label alone.
0
0
0