Multi-Resolution Multi-Head Attention in Deep Speaker Embedding

Zhiming Wang,Kaisheng Yao,Xiaolong Li,Shuo Fang

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING（2020）

引用 20|浏览2

暂无评分

摘要

Pooling is an essential component to capture long-term speaker characteristics for speaker recognition. This paper proposes simple but effective pooling methods to compute attentive weights for better temporal aggregation over the variable-length input speech, enabling the end-to-end neural network to have improved performance for discriminating among speakers. Particularly, we observe that using multiple heads for attentive pooling over the entire encoded sequence, a method we term as global multi-head attention, significantly improves performance in comparison to various pooling methods, including the recently proposed multi-head attention [1]. To improve diversity of attention heads, we further propose multi-resolution multi-head attention for pooling that has an additional temperature hyperparameter for each head. This leads to even larger performance gain, on top of that achieved using multiple heads. On the benchmark VoxCeleb1 dataset, the proposed method achieves the state-of-the-art performance of Equal Error Rate (EER) of 3.966%. Our analysis shows that using multiple heads and having multiple resolutions on these heads with different temperatures lead to improved certainty of attentive weights in the new state-of-the-art system.

查看译文

关键词

multi-head attention, global multi-head attention, multi-resolution multi-head attention, deep speaker embedding, speaker recognition

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要