Published on Sun Jan 26 2020

Multi-task Learning for Speaker Verification and Voice Trigger Detection

Siddharth Sigtia, Erik Marchi, Sachin Kajarekar, Devang Naik, John Bridle

Automatic speech transcription and speaker recognition are usually treated as separate tasks. In this study, we investigate training a single network to perform both tasks jointly. Results demonstrate that the network is able to code both phonetic and speaker information in its learnt representations.

0
0
0
Abstract

Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classification (CTC) loss while the speaker recognition branch of the network is trained to label the input sequence with the correct label for the speaker. We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data for each task. We evaluate the speech transcription branch of the network on a voice trigger detection task while the speaker recognition branch is evaluated on a speaker verification task. Results demonstrate that the network is able to encode both phonetic \emph{and} speaker information in its learnt representations while yielding accuracies at least as good as the baseline models for each task, with the same number of parameters as the independent models.

Sun Jan 26 2020
Machine Learning
Multi-task Learning for Voice Trigger Detection
0
0
0
Sun Dec 08 2019
NLP
A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: the DeepMine Database
DeepMine is a speech database in Persian and English designed to build and evaluate text-dependent, text-prompted, and text-independent speaker verification. It contains more than 1850 speakers and 540 thousand recordings overall.
0
0
0
Sun Sep 27 2015
Machine Learning
End-to-End Text-Dependent Speaker Verification
The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easy-to-maintain systems.
0
0
0
Sun Dec 09 2018
Machine Learning
To Reverse the Gradient or Not: An Empirical Comparison of Adversarial and Multi-task Learning in Speech Recognition
Multi-Task Learning and Adversarial Learning. Transcribed datasets typically contain speaker identity for each instance in the data. We explore the use of additionaltranscribed data in a semi-supervised, adversarial learning manner to improve error rates.
0
0
0
Fri Jun 22 2018
NLP
Weakly Supervised Training of Speaker Identification Models
We propose an approach for training speaker identification models in a weakly supervised manner. We report experiments on two different real-world datasets. On the VoxCeleb dataset, the method provides 94.6% accuracy on a closed set speaker identification task.
0
0
0
Mon Apr 02 2018
Artificial Intelligence
Speaker-Invariant Training via Adversarial Learning
We propose a novel adversarial multi-task learning scheme. The scheme speaker-invariant training (SIT) is jointly optimized to minimize the senone (tied triphone state) classification loss, and mini-maximize the speaker classification loss. A canonical DNN acoustic
0
0
0