This paper introduces a framework for artifact-aware temporal alignment to handle variable delays in asynchronous multimodal speech data and suppress processing artifacts. The proposed architecture includes a temporal alignment module and a dual-branch enhancement network for improved speech quality and ASR performance.
Key findings
Proposes a novel framework for artifact-aware temporal alignment.
Addresses variable delays and suppresses processing artifacts in multimodal speech enhancement.
Combines temporal alignment with a dual-branch enhancement network.
Expected to achieve superior performance in objective metrics and ASR word error rate.
Limitations & open questions
The proposed method's real-world deployment and practicality are yet to be validated.
The effectiveness of the framework under diverse real-world conditions remains to be seen.