This work proposes a novel way of synthesizing audio-driven portrait videos. We show that photo-realistic images can be rendered based on a small, fully connected neural network with the positional encoding of 3D face surface and additional audio-features extracted from an arbitrary English speech. The method is based on the intermediate geometry of 3DMMs. However, it is restricted neither by any face model in particular nor by its expression or identity space. The pipeline predicts both RGB color and 3D vertex displacement with respect to the mesh’s neutral space for a given speech. Temporal stabilization for audio-feature vectors filtering provides smooth lip-audio synchronization. The rendered face is seamlessly embedded into a background using the autoencoder network with 2D dilated convolutions. Furthermore, the method generalizes well for an arbitrary speech from an unknown source actor on the condition that the English language is used. Finally, some state-of-the-art projects were selected for the method evaluation. Our method outperforms all of them in terms of image quality while maintaining low lip synchronization error.
The full project description is available in my master’s thesis under link. The network accepts an arbitrary audio input to produce a photorealistic face image. The intermediate output of the network (the FC deformation NN section) is a mesh in the expression and shape space of the actor, however not restricted by any 3DMMs. Finally, based on the encoded 3D position (positional encoding from NeRF) the FC color NN produces the final image. The network is trained jointly.
To seamlessly embed the output image into the background another network is trained separately. Based on the background and the output from the previous network, which are eroded accordingly, the network correctly predicts the final image.
Photorealistic face images are predicted together with 3D mesh displacements. Below you can see output of each segment of the pipeline separately, starting from the left: deformation network, color network, composition network.
Below you can see some of the generated videos. From the left: original target actor video, rendered target actor (generated video using neural rendering driven by audio), on the right a source actor from where only voice is taken.
The source actor is from the series House of Cards.
The source actor is from the movie The Matrix.