Published on Mon Feb 24 2020

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, Yong-Jin Liu

Real-world talking faces often accompany with natural head movement. Most existing talking face video generation methods only consider facial grotesqueanimation with fixed head pose. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames.

0
0
0
Abstract

Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.

Thu Apr 22 2021
Machine Learning
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
0
0
0
Fri Apr 16 2021
Computer Vision
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation
In this paper, we propose a novel text-based talking-head video generation framework. It synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a
2
0
0
Sun Apr 25 2021
Computer Vision
3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head
0
0
0
Mon Dec 17 2018
Computer Vision
Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
Talking face generation aims to synthesize a face video with precise lip synchronization. Most existing methods mainly focus on disentangling the information in a single image. We propose a novel arbitrary talking face generation framework by discovering the audio-visual coherence.
0
0
0
Thu May 09 2019
Computer Vision
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.
0
0
0
Mon Aug 09 2021our pick
Computer Vision
AnyoneNet: Synchronized Speech and Talking Head Generation for arbitrary person
Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interactions. The proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-to-speech stage and a speech-driven stage.
3
8
48