TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality
arxiv(2024)
摘要
Automatic Speech Understanding (ASU) aims at human-like speech
interpretation, providing nuanced intent, emotion, sentiment, and content
understanding from speech and language (text) content conveyed in speech.
Typically, training a robust ASU model relies heavily on acquiring large-scale,
high-quality speech and associated transcriptions. However, it is often
challenging to collect or use speech data for training ASU due to concerns such
as privacy. To approach this setting of enabling ASU when speech (audio)
modality is missing, we propose TI-ASU, using a pre-trained text-to-speech
model to impute the missing speech. We report extensive experiments evaluating
TI-ASU on various missing scales, both multi- and single-modality settings, and
the use of LLMs. Our findings show that TI-ASU yields substantial benefits to
improve ASU in scenarios where even up to 95
Moreover, we show that TI-ASU is adaptive to dropout training, improving model
robustness in addressing missing speech during inference.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要