Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability technique

Neural Computing and Applications(2024)

引用 0|浏览3
暂无评分
摘要
Automatic speech recognition systems based on end-to-end models (E2E-ASRs) can achieve comparable performance to conventional ASR systems while reproducing all their essential parts automatically, from speech units to the language model. However, they hide the underlying perceptual processes modelled, if any, and they have lower adaptability to multiple application contexts, and, furthermore, they require powerful hardware and an extensive amount of training data. Model-explainability techniques can explore the internal dynamics of these ASR systems and possibly understand and explain the processes conducting to their decisions and outputs. Understanding these processes can help enhance ASR performance and reduce the required training data and hardware significantly. In this paper, we probe the internal dynamics of three E2E-ASRs pre-trained for English by building an acoustic-syllable boundary detector for Italian and Spanish based on the E2E-ASRs’ internal encoding layer outputs. We demonstrate that the shallower E2E-ASR layers spontaneously form a rhythmic component correlated with prominent syllables, central in human speech processing. This finding highlights a parallel between the analysed E2E-ASRs and human speech recognition. Our results contribute to the body of knowledge by providing a human-explainable insight into behaviours encoded in popular E2E-ASR systems.
更多
查看译文
关键词
Automatic speech recognition,Deep learning,End-to-end models,Long short-term memory model,Conformers,Transformers,Psycho-acoustics,Syllables
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要