Towards Accurate and Real-Time End-of-Speech Estimation

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览4
暂无评分
摘要
We introduce a variant of the endpoint (EP) detection problem in automatic speech recognition (ASR), which we call the end-of-speech (EOS) estimation. Given an utterance, EOS estimation aims to identify the timestamp when the utterance waveform has fully decayed and is then used to measure the EP latency. Accurate EOS estimation is difficult in large-scale streaming audio scenarios due to the hefty traffic and hardware limitations. To this end, we develop an efficient and accurate framework by performing force alignment on the 1-best ASR hypothesis. In particular, we propose to use binarized states sequences for alignment, which yields an EOS estimation robust to ASR hypothesis, and the estimation error is reduced by 28% compared to aligning on phoneme states. In addition, we further observe a 30% error reduction by applying the intermediate-stage embeddings of the encoder as additional features to compute the binary probabilities.
更多
查看译文
关键词
Endpoint detection,force alignment,Viterbi algorithm,speech recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要