Published on Sun Jun 27 2021

Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference

Riko Suzuki, Hitomi Yanaka, Koji Mineshima, Daisuke Bekki
0
0
0
Abstract

This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action labels, and 1,942 action triplets of the form that can be translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.

Wed Mar 25 2020
Artificial Intelligence
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
Video-and-Language Inference is a new multimodal understanding of video and text. A model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A large-scale dataset is introduced for this task, which consists of 95,322 video-
0
0
0
Mon Jul 26 2021
Computer Vision
Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference
Video-and-Language Inference is a recently proposed task for joint video and language understanding. This new task requires a model to draw perceptions on whether a natural language statement entails or contradicts a given video clip. We study how to address three critical challenges for this task.
1
0
1
Fri Apr 02 2021
Computer Vision
Visual Semantic Role Labeling for Video Understanding
VidSitu benchmark is a large-scale video understanding data source with -second movie clips richly annotated. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations.
1
1
4
Mon Nov 16 2020
Artificial Intelligence
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering
iPerceive is a framework capable of understanding the "why" between events in a video. It builds a common-sense knowledge base using contextual cues to infer causal relationships between objects in the video. We demonstrate the effectiveness of our technique using dense video captioning and video question answering tasks.
0
0
0
Fri Jun 25 2021
Computer Vision
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability
Deep learning models often perform poorly on tasks that require causal reasoning. Causality knowledge is vital to building robust AI systems. We propose iReason, a framework that infers visual-semantic commonsense knowledge using both videos and natural language captions.
1
0
1
Wed Sep 05 2018
Computer Vision
Localizing Moments in Video with Temporal Language
Localizing moments in a longer video via natural language queries is a new task. We propose a new model that explicitly reasons about different temporal segments in a video. We collect the novel TEMPOral reasoning in video and language (TEMPO) dataset.
0
0
0