Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING(2022)

引用 1|浏览32
暂无评分
摘要
Text-to-speech (TTS) systems are designed to synthesize natural and expressive speech, adapt to an unseen voice, and capture the speaking style of an unseen speaker by converting text into speech. The introduction of an unseen speaker's speaking style into a TTS system offers a wide range of application scenarios, including personal assistant, news broadcast, and audio navigation, among others. The style of the speech varies from person to person and every person exhibits his or her style of speaking that is determined by the language, demography, culture and other factors. Style is best captured by the prosody of a signal. It is an ongoing research area with numerous real-world applications that produces high-quality multi-speaker voice synthesis while taking into account prosody and in a zero-shot manner. Despite the fact that several efforts have been made in this area, it continues to be an interesting and difficult topic to solve. In this paper, we present a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) that leverages the normalization architecture and speaker encoder with non-autoregressive multi-head attention driven encoder-decoder architecture. Given an input text and a reference speech sample of an unseen person, ZSM-SS can generate speech in that person's style in a zero-shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We generate the 256 dimensional speaker embedding using a speaker encoder based on wav2vec2.0 based architecture. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK [1] and LibriTTS [2] datasets, using visualization of hessian of proposed model, multiple quantitative metrics that measure generated speech distortion and MOS, along with speaker embedding analysis of the proposed speaker encoder model.
更多
查看译文
关键词
Speech, Adaptation models, Speech synthesis, Training, Decoding, Transformers, Analytical models, Multispeaker text-to-speech, normalization, speaker encoding, zero shot speech synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要