This paper introduces FM-Face, a novel architecture using Frequency-Modulated LSTM cells to generate realistic and temporally coherent 3D facial mesh sequences synchronized with input audio. The architecture captures distinct frequency characteristics of facial motion: high-frequency for phonemes, mid-frequency for prosody, and low-frequency for emotion. Experiments on VOCAset, BIWI, and MMFace4D datasets demonstrate 15.2% lower vertex error than FaceFormer with 45 FPS inference.
Key findings
FM-Face introduces frequency-modulated recurrent processing for speech-driven facial animation.
Achieves state-of-the-art results with real-time performance.
15.2% lower vertex error than FaceFormer on VOCAset with 45 FPS inference.
Improvement of 16.9% in velocity error.
Limitations & open questions
Further research is needed to enhance the model's adaptability to diverse datasets.