Published on Fri Feb 08 2019

Speaker diarisation using 2D self-attentive combination of embeddings

Guangzhi Sun, Chao Zhang, Phil Woodland

Speaker diarisation systems often cluster audio segments using speaker embeddings. Since different types of embeddings are often complementary, this paper proposes a generic framework to improve performance.

0
0
0
Abstract

Speaker diarisation systems often cluster audio segments using speaker embeddings such as i-vectors and d-vectors. Since different types of embeddings are often complementary, this paper proposes a generic framework to improve performance by combining them into a single embedding, referred to as a c-vector. This combination uses a 2-dimensional (2D) self-attentive structure, which extends the standard self-attentive layer by averaging not only across time but also across different types of embeddings. Two types of 2D self-attentive structure in this paper are the simultaneous combination and the consecutive combination, adopting a single and multiple self-attentive layers respectively. The penalty term in the original self-attentive layer which is jointly minimised with the objective function to encourage diversity of annotation vectors is also modified to obtain not only different local peaks but also the overall trends in the multiple annotation vectors. Experiments on the AMI meeting corpus show that our modified penalty term improves the d- vector relative speaker error rate (SER) by 6% and 21% for d-vector systems, and a 10% further relative SER reduction can be obtained using the c-vector from our best 2D self-attentive structure.

Thu Oct 22 2020
Machine Learning
Combination of Deep Speaker Embeddings for Diarisation
This paper proposes a method to extract better-performing speaker embeddings. It uses multiple sets of complementary d-vectors derived from different NN components. A neural-based single-pass speaker diarisation pipeline is also proposed. Experiments and detailed analyses are conducted on challenging datasets.
0
0
0
Fri Feb 12 2021
Machine Learning
Content-Aware Speaker Embeddings for Speaker Diarisation
The content-aware speaker embeddings (CASE) approach is proposed, which extends the input of the speaker classifier to include phone, character, and word embeddings. Experimental results showed that CASE achieved a 17.8% relative relative relative speaker error rate reduction over conventional methods.
0
0
0
Wed Apr 07 2021
Machine Learning
Adapting Speaker Embeddings for Speaker Diarisation
The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. We propose three techniques that can be used to better adapt the speaker embeddeddings. All three techniques contribute positively to the performance of the system.
1
0
0
Mon Jan 25 2021
Machine Learning
Domain-Dependent Speaker Diarization for the Third DIHARD Challenge
This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. We explore speaker embeddings for \emph{acoustic domain identification} (ADI) task. The performance substantially improved over that of the baseline.
0
0
0
Fri Sep 13 2019
NLP
Probing the Information Encoded in X-vectors
Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks.
0
0
0
Fri Sep 13 2019
NLP
End-to-End Neural Speaker Diarization with Self-attention
Speaker diarization has been mainly developed based on the clustering of speaker embeddings. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. We evaluated our proposed method on simulated mixtures, real telephone calls, and real conversations.
1
0
0