Two Stage Audio-Video Speech Separation using Multimodal Convolutional Neural Networks

2019 Sensor Signal Processing for Defence Conference (SSPD)(2019)

引用 1|浏览2
暂无评分
摘要
The performance of the audio-only neural networks based monaural speech separation methods is still limited, particularly when multiple-speakers are active. The very recent method [1] used the audio-video (AV) model to find the non-linear relationship between the noisy mixture and the desired speech signal. However, the over-fitting problem always happens when the AV model is trained. Hence, the separation performance is limited. To address this limitation, we propose a system with two sequentially trained AV models to separate the desired speech signal. In the proposed system, after the first AV model is trained, its output is used to calculate the training target of the second AV model, which is exploited to further improve the separation performance. The GRID audiovisual sentence corpus is used to generate the training and testing datasets. The signal to distortion ratio (SDR) and short-time objective intelligibility (STOI) proved the proposed system outperforms the state-of-the-art method.
更多
查看译文
关键词
speech separation,mapping relation,AV model,sequentially AV models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要