Published on Tue Jul 30 2019

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Marcely Zanon Boito, William N. Havard, Mahault Garnerin, Éric Le Ferrand, Laurent Besacier

The CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages.

0
0
0
Abstract

The CMU Wilderness Multilingual Speech Dataset (Black, 2019) is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible) is the same for all the languages is not exploited to date.Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for typologically different language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs). Lastly, we showcase the usefulness of the final product on a bilingual speech retrieval task.

Mon Dec 07 2020
NLP
MLS: A Large-Scale Multilingual Dataset for Speech Research
This paper introduces Multilingual LibriSpeech (MLS) dataset. The dataset is derived from read audiobooks from LibriVox. It consists of 8 languages, including about 44.5K hours of English and a total of 6K hours for other languages.
0
0
0
Fri Nov 08 2019
NLP
Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates
Europarl-ST is a novel multilingual spoken language translation (SLT) corpus. It contains paired audio-text samples for SLT from and into 6 European languages. The corpus is released under a Creative Commons license.
0
0
0
Tue Feb 02 2021
NLP
The Multilingual TEDx Corpus for Speech Recognition and Translation
Multilingual TEDx corpus built to support speech recognition and speech translation research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 languages.
3
0
0
Fri Jul 27 2018
NLP
A small Griko-Italian speech translation corpus
The corpus consists of 330 utterances (about 20 minutes of speech) which have been transcribed and translated in Italian. The dataset is available online to encourage replicability and diversity in language documentation experiments.
0
0
0
Mon Jul 20 2020
NLP
CoVoST 2 and Massively Multilingual Speech-to-Text Translation
CoVoST 2 is a large-scale multilingual speech translation corpus. It covers translations from 21 languages into English and from English into 15 languages. This represents the largest open dataset available to date from total volume and language coverage.
0
0
0
Tue Feb 04 2020
NLP
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English, diversified with over 11,000speakers and over 60 accents. We describe the dataset creation methodology and provide empirical evidence of the quality of the data.
0
0
0