Multi-Resolution Multi-Head Attention in Deep Speaker Embedding

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING(2020)

引用 20|浏览2
暂无评分
摘要
Pooling is an essential component to capture long-term speaker characteristics for speaker recognition. This paper proposes simple but effective pooling methods to compute attentive weights for better temporal aggregation over the variable-length input speech, enabling the end-to-end neural network to have improved performance for discriminating among speakers. Particularly, we observe that using multiple heads for attentive pooling over the entire encoded sequence, a method we term as global multi-head attention, significantly improves performance in comparison to various pooling methods, including the recently proposed multi-head attention [1]. To improve diversity of attention heads, we further propose multi-resolution multi-head attention for pooling that has an additional temperature hyperparameter for each head. This leads to even larger performance gain, on top of that achieved using multiple heads. On the benchmark VoxCeleb1 dataset, the proposed method achieves the state-of-the-art performance of Equal Error Rate (EER) of 3.966%. Our analysis shows that using multiple heads and having multiple resolutions on these heads with different temperatures lead to improved certainty of attentive weights in the new state-of-the-art system.
更多
查看译文
关键词
multi-head attention, global multi-head attention, multi-resolution multi-head attention, deep speaker embedding, speaker recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要