Published on Thu Jul 15 2021

CLSRIL-23: Cross Lingual Speech Representations for Indic Languages

Anirudh Gupta, Harveen Singh Chadha, Priyanshi Shah, Neeraj Chimmwal, Ankur Dhuriya, Rishabh Gaur, Vivek Raghavan

CLSRIL-23 is a model trained on languages and almost 10,000 hours of audio data. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations.

1
11
49
Abstract

We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.

Wed Jun 24 2020
NLP
Unsupervised Cross-lingual Representation Learning for Speech Recognition
XLSR learns cross-lingual speech representations by pretraining a single model. XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results.
4
89
422
Fri Jul 23 2021
NLP
OLR 2021 Challenge: Datasets, Rules and Baselines
This paper introduces the sixth Oriental Language Recognition (OLR) 2021 tumultuousChallenge. It intends to improve the performance of language recognition and speech recognition systems within multilingual scenarios. The data profile, four tasks, two baselines, and the evaluation principles are described.
0
0
0
Wed Sep 05 2018
NLP
Pre-training on high-resource speech recognition improves low-resource speech-to-text translation
Pre-training can improve direct speech-to-text translation when the source language is low-resource. The pre-trained encoder accounts for most of the improvement, despite the fact that the shared language is the target language text.
0
0
0
Tue Dec 22 2020
NLP
Applying Wav2vec2.0 to Speech Recognition in Various Low-resource Languages
wav2vec2.0 has not been examined on real spoken scenarios and languages other than English. We achieve more than 20% relative improvements in six languages compared with previous work. English achieves a gain of 52.4%.
0
0
0
Mon Oct 29 2018
NLP
Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech Recognition
Automatic speech recognition (ASR) tasks are resolved by end-to-end deep learning models. We propose a novel deep learning model architecture namely cascaded CNN-resBiLSTM-CTC. By applying both simple.Fast Fourier Transform (FFT) technique and
0
0
0
Wed Jun 30 2021
NLP
IMS' Systems for the IWSLT 2021 Low-Resource Speech Translation Task
This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team. We utilize state-of-the-art models and several data augmentation, multi-task and transfer learning approaches.
0
0
0
Wed Jun 24 2020
NLP
Unsupervised Cross-lingual Representation Learning for Speech Recognition
XLSR learns cross-lingual speech representations by pretraining a single model. XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results.
4
89
422
Mon May 24 2021
NLP
Unsupervised Speech Recognition
Wav2vec-U is a method to train speech recognition models without any labeled data. It reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3.
7
27
110
Sat Jun 20 2020
NLP
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
wav2vec 2.0masks the speech input in the latent space and solves a contrastive task. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER.
2
9
35
Thu Oct 11 2018
NLP
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT is designed to pre-train deep                bidirectional representations from unlabeled text. It can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
14
8
15
Mon Dec 22 2014
Machine Learning
Adam: A Method for Stochastic Optimization
Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and has little memory requirements. It is well suited for problems that are large in terms of data and parameters.
3
0
2
Mon May 27 2019
Machine Learning
VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019
We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels. We utilize a vector quantized variational autoencoder (VQ-VAE) and a multi-scale codebook-to-
0
0
0