Expressive speech synthesis is part of the Human-Computer Interaction field. It requires knowledge in areas such as machine learning, signal processing, sociology, psychology. We present a history of the main methods of Text-to-Speech synthesis.
As part of the Human-Computer Interaction field, Expressive speech synthesis
is a very rich domain as it requires knowledge in areas such as machine
learning, signal processing, sociology, psychology. In this Chapter, we will
focus mostly on the technical side. From the recording of expressive speech to
its modeling, the reader will have an overview of the main paradigms used in
this field, through some of the most prominent systems and methods. We explain
how speech can be represented and encoded with audio features. We present a
history of the main methods of Text-to-Speech synthesis: concatenative,
parametric and statistical parametric speech synthesis. Finally, we focus on
the last one, with the last techniques modeling Text-to-Speech synthesis as a
sequence-to-sequence problem. This enables the use of Deep Learning blocks such
as Convolutional and Recurrent Neural Networks as well as Attention Mechanism.
The last part of the Chapter intends to assemble the different aspects of the
theory and summarize the concepts.