Published on Tue Sep 12 2017

End-to-End Audiovisual Fusion with LSTMs

Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic

Research on jointly extracting audio andvisual features and performing classification is very limited. We present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. The model consists of multiple identical identical streams, one for each modality, which extract features directly from mouth

0
0
0
Abstract

Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modeled by a BLSTM and the fusion of multiple streams/modalities takes place via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4 nonlingusitic vocalisations over audio-only classification is reported on the AVIC database. At the same time, the proposed end-to-end audiovisual fusion system improves the state-of-the-art performance on the AVIC database leading to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual speech recognition experiments on the OuluVS2 database using different views of the mouth, frontal to profile. The proposed audiovisual system significantly outperforms the audio-only model for all views when the acoustic noise is high.

Sun Feb 18 2018
Computer Vision
End-to-end Audiovisual Speech Recognition
This is the first audiovisual fusion model which learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition. The model consists of two streams, one for each modality, which extract features from mouth regions andraw waveforms.
0
0
0
Fri Jan 20 2017
Computer Vision
End-To-End Visual Speech Recognition With LSTMs
Traditional visual speech recognition systems consist of two stages, feature extraction and classification. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and also achieves state-of-the-art performance.
0
0
0
Tue Apr 02 2019
Computer Vision
End-to-End Visual Speech Recognition for Small-Scale Datasets
An absolute improvement of 0.6%, 3.9%, 11.4% over the state-of-the-art is reported on the OuluVS2, CUAVE,AVLetters and AVLetters2 databases.
0
0
0
Mon Nov 21 2016
Machine Learning
Robust end-to-end deep audiovisual speech recognition
Multi-modal speech recognition has not yet found wide-spread use. This paper presents an end-to-end audiovisual speech recognizer.
0
0
0
Thu Jan 22 2015
Machine Learning
Deep Multimodal Learning for Audio-Visual Speech Recognition
We present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition(AV-ASR) We study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space.
0
0
0
Tue May 12 2020
Computer Vision
Discriminative Multi-modality Speech Recognition
Vision is often used as a complementary modality for audio speech recognition. After combining visual modality, ASR is upgraded to the multi-modality speech recognition (MSR) In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements.
0
0
0