This paper introduces Dynamic Timbre Tracking (DTT), a framework for modeling the temporal evolution of acoustic parameters in continuous speech. DTT extends traditional static parameters into continuous trajectories using a unified state-space formulation, comprising feature extraction, continuous-state timbre modeling, and temporal dynamics characterization. The method is evaluated on benchmarks for speaker identification, emotion recognition, and voice quality assessment, showing competitive performance with DNN embeddings.
Key findings
DTT extends static acoustic parameters into continuous trajectories for timbre analysis.
The framework includes multi-resolution feature extraction and phoneme-aware modeling.
DTT characterizes rate-of-change, acceleration, and interaction patterns among timbre dimensions.
The method offers explicit interpretability and negligible inference cost compared to DNN embeddings.
DTT achieves competitive performance on established benchmarks for speech analysis tasks.
Limitations & open questions
The paper does not discuss the scalability of DTT for very large datasets.
The generalization of DTT to other languages and accents is not addressed.