Accent Conversion using Pre-trained Model and Synthesized Data from Voice Conversion

Conference of the International Speech Communication Association (INTERSPEECH)(2022)

引用 1|浏览16
暂无评分
摘要
Accent conversion (AC) aims to generate synthetic audios by changing the pronunciation pattern and prosody of source speakers (in source audios) while preserving voice quality and linguistic content. There has not been a parallel corpus that contains pairs of audios having the same contents yet coming from the same speakers in different accents, the authors hence work on a solution to synthesize one as training input. The training pipeline is conducted via two steps. First, a voice conversion (VC) model is constructed to synthesize a training data set, containing pairs of audios in the same voice but two different accents. Second, an AC model is trained with the synthesized data to convert a source accented speech to a target accented speech. Given the recognized success of self-supervised learning speech representation (wav2vec 2.0) on certain speech problems such as VC, speech recognition, speech translation, and speech-to-speech translation, we adopt this architecture with some customization to train the AC model in the second step. With just 9-hour synthesized training data, the encoder initialized by the weight of the pre-trained wav2vec 2.0 model outperforms the LSTM-based encoder.
更多
查看译文
关键词
Accent Conversion, Voice Conversion, seq2seq, wav2vec 2.0
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要