Published on Mon Oct 14 2019

The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Noé Tits, Kevin El Haddad, Thierry Dutoit

Expressive speech synthesis is part of the Human-Computer Interaction field. It requires knowledge in areas such as machine learning, signal processing, sociology, psychology. We present a history of the main methods of Text-to-Speech synthesis.

0
0
0
Abstract

As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology. In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the Chapter intends to assemble the different aspects of the theory and summarize the concepts.

Wed Mar 27 2019
Artificial Intelligence
Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis
The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Systems able to control style have been developed and show impressive results.
0
0
0
Tue May 12 2020
Machine Learning
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data.
0
0
0
Tue Jun 29 2021
NLP
A Survey on Neural Speech Synthesis
Text to speech is a hot research topic in speech,language, and machine learning communities. The development of deep learning and artificial intelligence has significantly improved the quality of synthesized speech in recent years.
9
116
455
Mon Mar 11 2019
Machine Learning
Deep Text-to-Speech System with Seq2Seq Model
Recent trends in neural network based text-to-speech/speech synthesis have employed recurrent Seq2seq architectures. We show that our proposed model can achieve attention alignment much faster than previous architectures. Good audio quality can be achieved with a model that's much smaller in size.
0
0
0
Mon Sep 14 2020
Machine Learning
Controllable neural text-to-speech synthesis using intuitive prosodic features
Neural text-to-speech synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database. The generated prosody is solely defined by the input text, which does not allow for different styles.
0
0
0
Mon Jan 11 2016
Neural Networks
Investigating gated recurrent neural networks for speech synthesis
Long short-term memory (LSTM) architecture is attractive because it addresses the vanishing gradient problem in standard RNNs. LSTMs can achieve significantly better performance on SPSS than deep feed-forward neural networks.
0
0
0