Improving the Performance of HMM-Based Voice Conversion using Context Clustering Decision Tree and Appropriate Regression Matrix Format Long Qin 1 , Yi-jian Wu 2 , Zhen-Hua Ling 3 , Ren-Hua Wang 4 iFLYTEK Speech Laboratory University of Science and Technology of China, Hefei, P.R.China 1 qinlong@mail.ustc.edu.cn, 2 jasonwu@mail.ustc.edu.cn, 3 zhling@ustc.edu, 4 rhw@ustc.edu.cn Abstract To improve the performance of the HMM-based voice conversion system in which the LSP coefficient is introduced as the spectral representation, a model clustering technique to tie HMMs into classes for the model adaptation, considering the phonetic and linguistic contextual factors of HMMs, is adopted in this paper. Besides, due to the relationship between the LSP coefficients of adjacent orders, an appropriate format of the regression matrix is suggested according to the small amount of the adaptation training data. Subjective and objective tests prove that the source HMMs can be adapted more accurately using the proposed method, meanwhile the synthetic speech generated from the adapted model has better discrimination and speech quality. Index Terms: model adaptation, regression matrix clustering, and regression matrix format 1. Introduction With the development of the corpus-based speech synthesis technique, the intelligibility and naturalness of the synthetic speech has been improved a lot. However, it is still a difficult problem for the corpus-based TTS system to synthesize speech of various speakers and speaking styles with a limited database. So the voice conversion technique which can convert one speaker’s voice to another speaker’s voice provides a positive approach to achieve the goal of synthesizing speech of multi-speakers. The HMM-based voice conversion system is built on the basis of the HMM-based speech synthesis. In the HMM-based speech synthesis system, spectrum, pitch and duration are modeled simultaneously in a unified framework of HMMs [1][2][3]. In addition, voice characteristics of the synthetic speech can be converted from one speaker to another by applying a model adaptation algorithm, such as the MLLR (maximum likelihood linear regression) algorithm [4][5], with a small amount of speech uttered by the target speaker. We have realized a HMM-based speech synthesis system in which the LSP (line spectral pair) coefficients and the STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weighted spectral contour) analysis-synthesis algorithm are introduced [6][7]. Then, by realizing the MLLR algorithm, we provide our synthesis system with the ability of synthesizing voice of various speakers. However, there still exist two main problems in the HMM-based voice conversion system. Firstly, the data-driven clustering method described in the MLLR algorithm ignores many contextual factors between HMMs, therefore some unrelated HMMs are forced into one class which will affect the accuracy of the model adaptation. Secondly, the system performance including the voice characteristics and voice quality of the synthetic speech decreases greatly when the adaptation training data is very limited. In order to solve these problems, a clustering method, considering the phonetic and linguistic connections between HMMs using the context decision tree, which has been applied similarly in both the HMM-based speech recognition and the HMM-based speech synthesis areas [8][9], is described in this paper. Moreover, an appropriate regression matrix format is suggested when very few training data is available, as the LSP coefficients of only several adjacent orders have strong correlations. In the following part of this paper, an overview of our HMM-based voice conversion system is presented in section 2. Section 3 describes the details of the proposed context clustering decision tree and the appropriate regression matrix for the model adaptation. Section 4 presents the results of experiments including subjective and objective evaluations while section 5 provides a final conclusion. 2. Overview of HMM-based voice conversion A framework of our HMM-based voice conversion system is shown in Figure 1. The system consists of three stages, the training stage, the adaptation stage, and the synthesis stage. In the training stage, the LSP coefficients and the logarithm of fundamental frequency are extracted by the STRAIGHT analysis. Afterwards, their dynamic parameters including delta and delta-delta coefficients are calculated. The MSD (multi-space probability distribution) HMMs are introduced to model spectrum and pitch parameters because of the discontinuity of pitch observations. And state durations are modeled by the multi-dimensional Gaussian distributions. In the adaptation stage, the spectrum, pitch and duration HMMs of the source speaker are all adapted to those of the target speaker. At first, the spectrum and pitch HMMs are adapted to the target speaker’s HMMs by MLLR with the context decision tree clustering. Then, on the basis of the converted spectrum and pitch HMMs, the target speaker’s utterances are segmented to get the duration adaptation data. So that the duration model adaptation can be achieved. In the synthesis stage, according to the given text to be synthesized, a sentence HMM is constructed by concatenating the converted phoneme HMMs. From the sentence HMM, the LSP and pitch parameter sequences are obtained using the speech parameter generation algorithm, where phoneme durations are determined based on the state duration distributions. Finally, the generated parameter sequences of spectrum, converted from the LSP coefficients, and F0 are put into the STRAIGHT decoder to synthesize the target speaker’s speech. 2250 INTERSPEECH 2006 - ICSLP September 17-21, Pittsburgh, Pennsylvania