Towards end-to-end Speaker Diarization with Generalized Neural Speaker Clustering

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2022)

引用 6|浏览66
暂无评分
摘要
Speaker diarization consists of many components, e.g., front-end processing, speech activity detection (SAD), overlapped speech detection (OSD) and speaker segmentation/clustering. Conventionally, most of the involved components are separately developed and optimized. The resulting speaker diarization systems are complicated and sometimes lack of satisfying generalization capabilities. In this study, we present a novel speaker diarization system, with a generalized neural speaker clustering module as the backbone. The whole system can be simplified to contain only two major parts, a speaker embedding extractor followed by a clustering module. Both parts are implemented with neural networks. In the training phase, an on-the-fly spoken dialogue generator is designed to provide the system with audio streams and the corresponding annotations in categories of non-speech, overlapped speech and active speakers. The chunk-wise inference and a speaker verification based tracing module are conducted to handle the arbitrary number of speakers. We demonstrate that the proposed speaker diarization system is able to integrate SAD, OSD and speaker segmentation/clustering, and yield competitive results in the VoxConverse20 benchmarks.
更多
查看译文
关键词
speaker diarization,speaker embedding,speaker clustering,overlap speech detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要