Facial Animation Generation Integrating Audio Content, Style, and Emotional Features

Facial Animation Generation Integrating Audio Content, Style, and Emotional Features

Introduction

The field of audio-driven facial animation has seen significant advancements in recent years, with applications ranging from film production and virtual avatars to video conferencing and bandwidth reduction. Traditional methods have primarily focused on achieving lip synchronization with audio, often neglecting other crucial aspects such as facial expressions and head movements. A truly realistic talking head video should satisfy three key requirements: maintaining the identity of the target person, ensuring accurate lip movements synchronized with speech, and incorporating natural facial expressions and head motions.

While existing techniques like Wav2Lip have successfully addressed lip synchronization, they fall short in generating expressive facial animations that include emotional nuances and dynamic head movements. Some recent approaches, such as MakeItTalk, have introduced methods to produce subtle head motions, but these often suffer from video blurring and lack emotional depth. Similarly, methods like MSAAN focus on emotional lip-syncing but are limited to specific target identities and require input images with matching emotions.

To overcome these limitations, this paper introduces a novel approach called ACSEF (Audio Content, Style, and Emotional Features), which integrates audio-driven content, speaker style, and emotional characteristics to generate high-quality facial animations. The proposed method consists of two key modules: the Emotion Animation Module (EAM) for extracting and mapping emotional features to facial landmarks, and the Attention-Augmented Decoder based on U-Net (AADU) for synthesizing realistic video frames with enhanced texture and detail.

Background and Related Work

Audio-driven facial animation techniques can be broadly categorized into two approaches: end-to-end mapping and landmark-based decoding. End-to-end methods, such as Wav2Lip and X2Face, directly generate video frames from audio inputs using convolutional neural networks. While these methods achieve good lip synchronization, they often struggle with preserving facial identity and generating natural head movements. Additionally, many end-to-end approaches require multiple video frames as input, making them less suitable for applications with bandwidth constraints.

Landmark-based methods, on the other hand, first predict facial landmarks from audio and then decode these landmarks into video frames. Techniques like MakeItTalk and MSAAN have demonstrated success in producing arbitrary talking head animations with lip-syncing and limited head motions. However, these methods either lack emotional expressiveness or suffer from video quality degradation.

Recent advancements have attempted to address these shortcomings. For instance, EVP (Emotional Video Portraits) introduced cross-reconstruction emotion disentanglement to extract emotion-related features from audio. However, it still faces challenges in maintaining high-fidelity textures and handling arbitrary target identities. Similarly, Audio2Head focuses on generating natural head motions but does not incorporate emotional expressions.

Methodology

The ACSEF framework is designed to generate high-quality talking head videos that incorporate lip synchronization, head movements, and emotional expressions. The system operates in two main stages: landmark prediction and video frame synthesis.

Emotion Animation Module (EAM)

The EAM is responsible for extracting emotional features from audio and mapping them to facial landmark displacements. Unlike traditional methods that treat emotion as a secondary feature, EAM explicitly disentangles emotional characteristics from speech content and speaker style. This is achieved using a cross-reconstruction emotion disentanglement technique, which trains an emotion encoder to produce emotion embeddings independent of speech content.

The emotion embeddings are then processed by a recurrent neural network (LSTM) to capture temporal dependencies between audio features and facial movements. The LSTM outputs are fed into a multi-layer perceptron (MLP) to predict emotion-driven landmark displacements. These displacements are combined with those from the content and style animation modules to produce the final predicted landmarks.

A key innovation in EAM is the use of a loss function that minimizes both the distance between predicted and ground truth landmarks and their relative spatial relationships. This ensures that facial shapes and expressions are preserved while allowing for natural variations in emotional intensity.

Attention-Augmented Decoder based on U-Net (AADU)

The second stage of ACSEF involves synthesizing realistic video frames from the predicted landmarks. Traditional decoders often struggle with preserving fine details such as skin texture and facial shadows, leading to blurry or unrealistic outputs. To address this, the proposed AADU integrates a U-Net architecture with a Convolutional Block Attention Module (CBAM).

The U-Net structure enables multi-scale feature extraction, while the CBAM enhances the decoder’s ability to focus on critical regions such as the lips, eyes, and facial contours. The spatial attention component directs the network to prioritize areas that contribute most to emotional expression, while the channel attention mechanism optimizes feature map distributions.

The AADU is trained using a combination of pixel-wise reconstruction loss and perceptual loss. The former ensures that the generated frames closely match the ground truth, while the latter leverages pre-trained VGG19 features to maintain high-level semantic consistency.

Experimental Results

The proposed method was evaluated on the MEAD dataset, which contains high-quality emotional talking head videos from 60 actors, covering eight basic emotions. Comparative experiments were conducted against baseline methods, including MakeItTalk, MSAAN, and Audio2Head.

Objective Evaluation

Quantitative metrics were used to assess performance, including:
• F-LMD and M-LMD: Measures of facial and mouth landmark accuracy, where lower values indicate better alignment with ground truth.

• SSIM and PSNR: Indicators of video frame quality, with higher values representing better structural similarity and signal-to-noise ratio.

ACSEF demonstrated superior performance across all metrics. Compared to MSAAN, it achieved a 0.03 reduction in F-LMD and improvements of 0.02 in SSIM and 0.05 in PSNR. While MSAAN performed slightly better in lip synchronization (M-LMD), ACSEF excelled in overall facial expressiveness and video quality.

Ablation studies further confirmed the contributions of EAM and AADU. Removing EAM led to a decline in emotional expressiveness, while omitting AADU resulted in reduced image clarity. The full ACSEF model consistently outperformed its variants, highlighting the importance of both modules.

Subjective Evaluation

A user study involving 30 participants assessed the perceived quality of generated videos. Three criteria were evaluated:
• Lip Synchronization (LS): Accuracy of lip movements relative to audio.

• Vivid Expression (VE): Naturalness and intensity of emotional expressions.

• Video Perceptual Quality (VPQ): Overall visual realism and detail.

ACSEF scored highest in VE (4.03) and VPQ (3.12), outperforming all baselines. While MSAAN achieved a marginally better LS score (2.63 vs. 2.46), participants rated ACSEF as significantly more expressive and visually appealing.

Qualitative Analysis

Visual comparisons revealed that ACSEF generated the most realistic and emotionally expressive results, closely resembling ground truth videos. MakeItTalk produced blurry outputs with inconsistent lip movements, while MSAAN exhibited artifacts around the eyes. Audio2Head struggled with identity preservation and emotional rendering.

The effectiveness of EAM was further illustrated through landmark visualizations. Without EAM, predicted landmarks failed to capture emotional nuances, such as raised eyebrows for anger or downturned lips for sadness. The full ACSEF model accurately reproduced these details, demonstrating its ability to enhance emotional expressiveness.

Discussion and Future Work

The ACSEF method represents a significant step forward in audio-driven facial animation by integrating emotional features with content and style modeling. The EAM module successfully bridges the gap between audio emotion and facial expressions, while the AADU decoder ensures high-fidelity video synthesis.

One current limitation is the reliance on landmark detectors designed for real faces, which restricts applicability to cartoon or stylized characters. Future work could explore adaptive landmark detection techniques to broaden the method’s usability. Additionally, extending the framework to handle multi-speaker interactions or dynamic background scenes would further enhance its practical utility.

Conclusion

This paper presented ACSEF, a novel approach for generating high-quality talking head animations that incorporate lip synchronization, head movements, and emotional expressions. By introducing the Emotion Animation Module and Attention-Augmented Decoder, the method achieves superior performance in both objective metrics and subjective evaluations. Experimental results demonstrate significant improvements over existing techniques, making ACSEF a promising solution for applications in entertainment, virtual communication, and beyond.

doi.org/10.19734/j.issn.1001-3695.2024.04.0168

Was this helpful?

0 / 0