This paper presents the first theoretical characterization of loss-quality divergence in neural TTS fine-tuning, formalizing the divergence through a decomposition of the training objective into three components: acoustic reconstruction error, linguistic alignment error, and prosodic fidelity error. The study proves that gradient descent on the composite loss can lead to parameter updates that minimize acoustic reconstruction at the expense of perceptual quality, identifying the NTK overlap as the critical factor governing divergence.
Key findings
Loss-quality divergence in neural TTS fine-tuning is characterized theoretically.
Training objective is decomposed into acoustic reconstruction error, linguistic alignment error, and prosodic fidelity error.
Gradient descent can lead to updates minimizing acoustic reconstruction while degrading perceptual quality.
NTK overlap between acoustic and perceptual task manifolds is identified as the critical factor governing divergence.
Limitations & open questions
The study focuses on neural TTS systems and may not generalize to other domains.
Further research is needed to explore the practical applications of the theoretical findings in diverse TTS scenarios.