Learning Mask Scalars for Improved Robust Automatic Speech Recognition

2022 IEEE Spoken Language Technology Workshop (SLT)(2023)

引用 1|浏览48
暂无评分
摘要
Improving robustness of streaming automatic speech recognition (ASR) systems using neural network based acoustic frontends is challenging because of causality constraints and the speech-distortions introduced by the frontend. Time-frequency masking based approaches are commonly used, but they need additional hyperparameters – mask scalars – to limit distortion. Mask scalars are typically hand-tuned and chosen conservatively. In this work, we present a technique to predict mask scalars using ASR loss in an end-to-end fashion, with minimal increase in model size and complexity. We evaluate the approach on two robust ASR tasks: multichannel enhancement in the presence of speech and non-speech noise, and acoustic echo cancellation (AEC). Results show that the presented algorithm consistently improves word error rate (WER) over strong baselines that use hand-tuned hyperparameters: up to 16% in noisy conditions, and up to 7% for AEC.
更多
查看译文
关键词
speech recognition,time-frequency masking,speech enhancement,acoustic echo cancellation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要