Google'S Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-To-Sequence Lstm-Based Autoencoders

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION(2017)

引用 29|浏览65
暂无评分
摘要
A neural network model that significant improves unit selection -based Text-To-Speech synthesis is presented. The model employs a sequence-to-sequence LSTM-based autoencoder that compresses the acoustic and linguistic features of each unit to a fixed-size vector referred to as an embedding. Unit-selection is facilitated by formulating the target cost as an L2 distance in the embedding space. In open-domain speech synthesis the method achieves a 0.2 improvement in the MOS, while for limited-domain it reaches the cap of 4.5 MOS. Furthermore. the new TTS system halves the gap between the previous unit-selection system and WaveNet in terms of quality while retaining low computational cost and latency.
更多
查看译文
关键词
text-to-speech synthesis, LSTM, unit-selection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要