Published on Sun Nov 10 2019

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Chao Zhang, Zichao Yang, Xiaodong He, Li Deng

Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. This paper provides a comprehensive analysis of recent works on multimodal deep learning. The main focus of this review is the combination of vision andnatural language modalities.

0
0
0
Abstract

Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

Mon May 24 2021
Computer Vision
Recent Advances and Trends in Multimodal Deep Learning: A Review
The goal of multimodal deep learning is to create models that can process and link information using various modalities. This paper focuses on multiple types of modalities, including image, video, text, audio, body gestures, facial expressions, and physiological signals.
1
0
0
Fri Oct 16 2020
Computer Vision
New Ideas and Trends in Deep Multimodal Content Understanding: A Review
The focus of this survey is on the analysis of two modalities of multimodal. deep learning: image and text. Unlike classic reviews of deep learning where. VGG, ResNet and Inception module are central topics, this. paper will examine recent multimodals deep models.
0
0
0
Fri May 26 2017
Machine Learning
Multimodal Machine Learning: A Survey and Taxonomy
Multimodal machine learning aims to build models that can process and relate information from multiple modalities. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
0
0
0
Mon Nov 10 2014
Machine Learning
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
Pipeline unifies joint image-text embedding models with multimodal neural language models. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. We match the state-of-the-art performance on Flickr8K and Flickr30
0
0
0
Sat Sep 02 2017
Artificial Intelligence
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification
We propose two deep learning architectures with multimodal cross-connections that allow for dataflow between several feature extractors. Our models derive more interpretable features and achieve better performances than models which do not exchange representations. We provide the research community with Digits, a new dataset comprised of three data types extracted from videos of people saying
0
0
0
Tue May 14 2019
Artificial Intelligence
Strong and Simple Baselines for Multimodal Utterance Embeddings
Human language is a rich multimodal signal consisting of spoken words, facial expressions, body gestures, and vocal intonations. Learning representations for these spoken utterances is a complex research problem due to the presence of multiple heterogeneous sources of information.
0
0
0