Learning Mask Scalars for Improved Robust Automatic Speech Recognition

Arun Narayanan,James Walker,Sankaran Panchapagesan,Nathan Howard,Yuma Koizumi

2022 IEEE Spoken Language Technology Workshop (SLT)（2023）

引用 1|浏览48

暂无评分

摘要

Improving robustness of streaming automatic speech recognition (ASR) systems using neural network based acoustic frontends is challenging because of causality constraints and the speech-distortions introduced by the frontend. Time-frequency masking based approaches are commonly used, but they need additional hyperparameters – mask scalars – to limit distortion. Mask scalars are typically hand-tuned and chosen conservatively. In this work, we present a technique to predict mask scalars using ASR loss in an end-to-end fashion, with minimal increase in model size and complexity. We evaluate the approach on two robust ASR tasks: multichannel enhancement in the presence of speech and non-speech noise, and acoustic echo cancellation (AEC). Results show that the presented algorithm consistently improves word error rate (WER) over strong baselines that use hand-tuned hyperparameters: up to 16% in noisy conditions, and up to 7% for AEC.

查看译文

关键词

speech recognition,time-frequency masking,speech enhancement,acoustic echo cancellation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要