Speech-Driven 4D Facial Animation with Frequency-Modulate...

ABSTRACT

This paper introduces FM-Face, a novel architecture using Frequency-Modulated LSTM cells to generate realistic and temporally coherent 3D facial mesh sequences synchronized with input audio. The architecture captures distinct frequency characteristics of facial motion: high-frequency for phonemes, mid-frequency for prosody, and low-frequency for emotion. Experiments on VOCAset, BIWI, and MMFace4D datasets demonstrate 15.2% lower vertex error than FaceFormer with 45 FPS inference.

PAPER · PDF

manuscript.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

FM-Face introduces frequency-modulated recurrent processing for speech-driven facial animation.

Achieves state-of-the-art results with real-time performance.

15.2% lower vertex error than FaceFormer on VOCAset with 45 FPS inference.

Improvement of 16.9% in velocity error.

Limitations & open questions

Further research is needed to enhance the model's adaptability to diverse datasets.

Speech-Driven 4D Facial Animation with Frequency-Modulated LSTM

Key findings

Limitations & open questions

Related Papers