Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification.

Yanfeng Wu,Junan Zhao,Chenkai Guo,Jing Xu

Interspeech（2021）

引用 4|浏览6

暂无评分

摘要

Deep Convolutional Neural Network (CNN) based speaker embeddings, such as r-vectors, have shown great success in text-independent speaker verification (TI-SV) task. However, previous deep CNN models usually use fixed-length samples for training and employ variable-length utterances for speaker embeddings, which generates a mismatch between training and embedding. To address this issue, we investigate the effect of employing variable-length training samples on CNN-based TI-SV systems and explore two approaches to improve the performance of deep CNN architectures on TI-SV through capturing variable-term contexts. Firstly, we present an improved selective kernel convolution which allows the networks to adaptively switch between short-term and long-term contexts based on variable-length utterances. Secondly, we propose a multi-scale statistics pooling method to aggregate multiple time-scale features from different layers of the networks. We build a novel ResNet34 based architecture with two proposed approaches. Experiments are conducted on the VoxCeleb datasets. The results demonstrate that the effect of using variable-length samples is diverse in different networks and the architecture with two proposed approaches achieves significant improvement over r-vectors baseline system.

查看译文

关键词

speaker verification,CNN,selective kernel,multi-scale aggregation,statistics pooling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要