Published on Tue Mar 31 2020

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles

Video captioning is a challenging task that requires a deep understanding of scenes. State-of-the-art methods generate captions using either scene-level or object-level information. They often fail to make visually grounded predictions, and are sensitive to spurious correlations.

0
0
0
Abstract

Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.

Sat Aug 14 2021our pick
Computer Vision
Cross-Modal Graph with Meta Concepts for Video Captioning
Video captioning targets interpreting the complex visual contents as text descriptions. Prevailing methods adopt off-the-shelf object detection networks. We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
0
0
0
Tue Jun 04 2019
Computer Vision
Relational Reasoning using Prior Knowledge for Visual Captioning
Most existing methods resort to first detecting objects and then generating textual descriptions. We exploit prior human.commonsense knowledge for reasoning relationships between objects without any pre-trained detectors. The prior knowledge (e.g., in the form of knowledge graph) provides commonsense semantic correlation and constraint.
0
0
0
Thu Nov 16 2017
Computer Vision
Grounded Objects and Interactions for Video Captioning
We address the problem of video captioning by grounding language generation on object interactions in the video. We discuss the challenges and benefits of such an approach. We demonstrate state-of-the-art results on the ActivityNet Captions dataset using our model.
0
0
0
Sun Aug 08 2021
Computer Vision
Discriminative Latent Semantic Graph for Video Captioning
Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the. object-level interactions and frame-level information from complexspatio-temporal data.
2
0
0
Sun Mar 08 2020
Computer Vision
OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement
Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Without associating the moving trajectories, these image-based data-driven methods cannot understand the activities from the spatio-temporal transitions.
0
0
0
Mon Apr 23 2018
Computer Vision
Jointly Localizing and Describing Events for Dense Video Captioning
Automatically describing a video with natural language is regarded as a fundamental challenge in computer vision. A valid question is how to temporarily localize and then describe events, which is known as "dense video captioning"
0
0
0