This paper proposes a framework to extend clean speech inversion to unseen speakers using self-supervised speech representations and low-rank adaptation modules, enabling rapid speaker-specific customization with minimal training data.
Key findings
Proposed a comprehensive framework for extending clean speech inversion to unseen speakers.
Combined self-supervised speech representations with low-rank adaptation for speaker-specific customization.
Introduced a multi-stage adaptation pipeline for refining speaker embeddings while preserving phonetic content.
Validated the method through extensive experiments on multiple datasets, showing state-of-the-art performance in speaker-independent AAI.
Limitations & open questions
The practical deployment of AAI systems requires further generalization to more diverse speaker populations.
The collection of articulatory data is expensive and invasive, limiting the size of datasets.