Unsupervised Clustering Of Emotion And Voice Styles For Expressive Tts

Florian Eyben,Sabine Buchholz,Norbert Braunschweiler,Javier Latorre,Vincent Wan,Mark J. F. Gales,Kate Knill

ICASSP（2012）

引用 91|浏览173

暂无评分

摘要

Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this "expression cluster" information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.

查看译文

关键词

Expressive synthesis,text-to-speech,unsupervised clustering,Average Voice Model,HMM-TTS

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要