
Frame labeling and mapping for non-parallel voice conversion

2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP)(2017)

引用 0|浏览66
Voice conversion is to convert one person's voice into another person's voice. Depending on whether the contents of the speech data from both source and target speakers are the same, there are two types of conversion, namely, parallel or non-parallel voice conversions. For parallel voice conversion, since the contents of the speech data from the two speakers are the same, alignment methods can be easily used to establish the correspondence between the speech data of the two speakers. When applying the same methods from parallel voice conversion to non-parallel voice conversion, the mapping of corresponding signal segments is not straightforward. Recently, we proposed to use a DNN-HMM (Hybrid Deep Neural Network - Hidden Markov Model) recognizer to label each frame of the speech data from both source and target speakers, and establish mapping by clustering the vector of pseudo-likelihood of each frame. The experiments showed that the method generates results that are comparable to parallel voice conversion method. In this work, we further study how the method works for different settings in the frame mapping process. Using an exemplar-based parallel method conversion method for testing, we compare our method with the state-of-the-art method INCA (An Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment method). The experiments show that the proposed method generates results similar to those generated by INCA-based voice conversion.
non-parallel voice conversion,frame mapping,dnn-hmm recognizer,clustering
AI 理解论文
Chat Paper