Published on Fri Oct 30 2020

Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition

Wei Zhou, Simon Berger, Ralf Schlüter, Hermann Ney

A phonetic context size of one is shown to be sufficient for the best performance. The overall performance of our best model is comparable to state-of-the-art results for the TED-LIUM Release 2 and Switchboard.

0
0
0
Abstract

To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.

Wed Mar 22 2017
Neural Networks
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
CTC word models require orders of magnitude more data to reliably train compared to traditional systems. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder.
0
0
0
Fri Dec 08 2017
Neural Networks
Building competitive direct acoustics-to-word models for English conversational speech recognition
Direct acoustics-to-word (A2W) models have received increasing attention compared to conventional sub-word based automatic speech recognition models. A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model.
0
0
0
Wed Jul 29 2015
Machine Learning
EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding
The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs) Building a new ASR system remains a challenging task, requiring resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically
0
0
0
Mon Oct 31 2016
Neural Networks
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
We model the output vocabulary of about 100,000 words using deep bi-directional LSTM RNNs with CTC loss. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model.
0
0
0
Thu Jun 13 2019
Machine Learning
Telephonetic: Making Neural Language Models Robust to ASR and Semantic Noise
Speech processing systems rely on robust feature extraction to handle phonetic and semantic variations found in natural language. To capture phonetic alterations, we employ a character-level language model trained using probabilistic masking. Words are selected for augmentation according to a hierarchical grammar sampling strategy.
0
0
0
Wed Feb 20 2019
NLP
Phoneme Level Language Models for Sequence Based Low Resource ASR
Building multilingual and crosslingual models help bring different languages together in a language universal space. It allows models to share parameters and transfer knowledge across languages, enabling faster and better adaptation to a new language. These approaches are particularly useful for low resource languages.
0
0
0