Being Greedy Does Not Hurt: Sampling Strategies for End-To-End Speech Recognition

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)

引用 0|浏览29
暂无评分
摘要
Maximum Likelihood Estimation (MLE) is currently the most common approach to train large scale speech recognition systems. While it has significant practical advantages, MLE exhibits several drawbacks known in literature: training and inference conditions are mismatched and a proxy objective is optimized instead of word error rate. Recently, the Optimal Completion Distillation (OCD) training method was proposed which attempts to address some of those issues. In this paper, we analyze if the method is competitive over a strong MLE baseline and investigate its scalability towards large speech data beyond read speech, which to our knowledge is the first attempt known in literature. In addition, we propose and analyze several sampling strategies trading off exploration and exploitation of unseen prefixes and their effect on ASR accuracy. We conduct several experiments on both public LibriSpeech data and in-house large scale far-field data and compare models trained with MLE and OCD. Our proposed greedy sampling with soft targets approach proves most effective and yields a 9% rel. word error rate improvement over the i.i.d sampling. Finally, we note that OCD method improves over the MLE without label smoothing by 12%, and underperform by 6% once label smoothing is introduced to MLE.
更多
查看译文
关键词
Speech recognition,non-maximum likelihood training,training criteria
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要