Learning Discriminative Features For Speaker Identification And Verification

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES(2018)

引用 67|浏览3
暂无评分
摘要
The success of any Text Independent Speaker Identification and/or Verification system relies upon the system's capability to learn discriminative features. In this paper we propose a Convolutional Neural Network (CNN) Architecture based on the popular Very Deep VGG [1] CNNs, with key modifications to accommodate variable length spectrogram inputs, reduce the model disk space requirements and reduce the number of parameters, resulting in significant reduction in training times. We also propose a unified deep learning system for both Text -Independent Speaker Recognition and Speaker Verification, by training the proposed network architecture under the joint supervision of Softmax loss and Center loss [2] to obtain highly discriminative deep features that are suited for both Speaker Identification and Verification Tasks. We use the recently released VoxCeleb dataset [3], which contains hundreds of thousands of real world utterances of over 1200 celebrities belonging to various ethnicities, for bench marking our approach. Our best CNN model achieved a Top 1 accuracy of 84.6%, a 4% absolute improvement over VoxCeleb's approach, whereas training in conjunction with Center Loss improved the Top -1 accuracy to 89.5%, a 9% absolute improvement over Voxceleb's approach.
更多
查看译文
关键词
speaker identification, speaker recognition, speaker verification, convolutional neural network, discriminative feature learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要