Published on Thu Nov 17 2016

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua

Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input

0
0
0
Abstract

Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism --- a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the proposed SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. It is consistently observed that SCA-CNN significantly outperforms state-of-the-art visual attention-based image captioning methods.

Thu Dec 15 2016
Computer Vision
Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering
Image captioning has been remarkably advanced in recent years. Most existing paradigms may suffer from deficiency of invariance to images with different scaling, rotation, etc. We propose a novel image captioning architecture, termed Recurrent Image Captioner.
0
0
0
Mon Jun 26 2017
Computer Vision
Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention
Image captioning has been recently gaining a lot of attention thanks to deep captioning architectures. At the same time, a significant research effort has been dedicated to the development of saliency prediction models.
0
0
0
Wed Mar 06 2019
Computer Vision
Human Attention in Image Captioning: Dataset and Analysis
Human attention behaviour differs in free-viewing and image description tasks. There is a strong relationship between described objects and attended objects. Soft-attention mechanism differs from human attention, both spatially andporporally.
0
0
0
Mon Nov 02 2020
Computer Vision
Dual Attention on Pyramid Feature Maps for Image Captioning
Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules.
4
0
2
Tue Dec 06 2016
Artificial Intelligence
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
Attention-based neural encoder-decoder frameworks have been widely adopted. Most methods force visual attention to be active for every generated word. The decoder likely requires little to no visual information from the image to predict non-visual words such as "the" and "of"
0
0
0
Wed May 23 2018
Computer Vision
CNN+CNN: Convolutional Decoders for Image Captioning
Image captioning is a challenging task that combines the field of computer vision and natural language processing. A variety of approaches have been proposed to achieve the goal of automatically describing an image. We propose a framework that only employs convolutional neural networks (CNNs) to generate captions.
0
0
0