谷歌浏览器插件
订阅小程序
在清言上使用

Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations

IEEE International Conference on Acoustics, Speech, and Signal Processing(2024)

引用 0|浏览33
暂无评分
摘要
We propose unifying one-shot voice conversion and cloning into a single model that can be end-to-end optimized. To achieve this, we introduce a novel extension to a speech variational auto-encoder (VAE) that disentangles speech into content and speaker representations. Instead of using a fixed Gaussian prior as in the vanilla VAE, we incorporate a learnable text-aware prior as an informative guide for learning the content representation. This results in a content representation with reduced speaker information and more accurate linguistic information. The proposed model can sample the content representation using either the posterior conditioned on speech or the text-aware prior with textual input, enabling one-shot voice conversion and cloning, respectively. Experiments show that the proposed method achieves better or comparable overall performance for one-shot voice conversion and cloning compared to state-of-the-art voice conversion and cloning methods.
更多
查看译文
关键词
Voice conversion,voice cloning,VAE,speech disentanglement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要