Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. This paper provides a comprehensive analysis of recent works on multimodal deep learning. The main focus of this review is the combination of vision andnatural language modalities.
Deep learning methods have revolutionized speech recognition, image
recognition, and natural language processing since 2010. Each of these tasks
involves a single modality in their input signals. However, many applications
in the artificial intelligence field involve multiple modalities. Therefore, it
is of broad interest to study the more difficult and complex problem of
modeling and learning across multiple modalities. In this paper, we provide a
technical review of available models and learning methods for multimodal
intelligence. The main focus of this review is the combination of vision and
natural language modalities, which has become an important topic in both the
computer vision and natural language processing research communities. This
review provides a comprehensive analysis of recent works on multimodal deep
learning from three perspectives: learning multimodal representations, fusing
multimodal signals at various levels, and multimodal applications. Regarding
multimodal representation learning, we review the key concepts of embedding,
which unify multimodal signals into a single vector space and thereby enable
cross-modality signal processing. We also review the properties of many types
of embeddings that are constructed and learned for general downstream tasks.
Regarding multimodal fusion, this review focuses on special architectures for
the integration of representations of unimodal signals for a particular task.
Regarding applications, selected areas of a broad interest in the current
literature are covered, including image-to-text caption generation,
text-to-image generation, and visual question answering. We believe that this
review will facilitate future studies in the emerging field of multimodal
intelligence for related communities.