HIGH ACCURATE MODEL-INTEGRATION-BASED VOICE CONVERSION USING DYNAMIC FEATURES AND MODEL STRUCTURE OPTIMIZATION Daisuke Saito 1 , Shinji Watanabe 2 , Atsushi Nakamura 2 , and Nobuaki Minematsu 1 1 The University of Tokyo, Tokyo, Japan 2 NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan {dsk saito,mine}@gavo.t.u-tokyo.ac.jp, {watanabe,ats}@cslab.kecl.ntt.co.jp ABSTRACT This paper combines a parameter generation algorithm and a model optimization approach with the model-integration-based voice con- version (MIVC). We have proposed probabilistic integration of a joint density model and a speaker model to mitigate a requirement of the parallel corpus in voice conversion (VC) based on Gaussian Mix- ture Model (GMM). As well as the other VC methods, MIVC also suffers from the problems; the degradation of the perceptual quality caused by the discontinuity through the parameter trajectory, and the difculty to optimize the model structure. To solve the problems, this paper proposes a parameter generation algorithm constrained by dynamic features for the rst problem and an information cri- terion including mutual inuences between the joint density model and the speaker model for the second problem. Experimental results show that the rst approach improved the performance of VC and the second approach appropriately predicted the optimal number of mixtures of the speaker model for our MIVC. Index Terms— Voice conversion, probabilistic integration, dy- namic features, information criterion 1. INTRODUCTION Voice conversion (VC) is a technique to transform an inputted ut- terance of a speaker to another utterance that sounds like another speaker’s voice without changing the linguistic content. VC can be regarded as a technique to modify inputted features to features of a desired target. Then VC techniques have potentials of applying to many research areas of speech processing besides speech synthesis or speech generation [1, 2]. To derive appropriate features of a target speaker from a source speaker’s features by VC techniques, two important functions should be considered; to model the proper correspondence of the source fea- tures to the target features, and to represent a feature space of the tar- get precisely. Although there have been several proposed techniques for voice conversion based on statistical approaches [1, 3, 4], they strongly focus on the rst function. To realize this function, they re- quire the parallel corpus for training, which contains plenty of utter- ances with the same linguistic content both the source and the target. On the other hand, we have proposed the model-integration-based voice conversion (MIVC) which focuses not only on the rst func- tion, but also on the second function, i.e., to model the precise feature space of the target speaker [5]. Our method uses non-parallel speech data of the target speaker to construct the speaker model of the tar- get. Then it effectively mitigates the data sparse problem caused by the requirement of the parallel corpus. There are other approaches focusing on the efcient use of non-parallel data [6, 7]. They have applied parameter adaptation techniques to parameters of the joint density model, which is constructed to model the relation between the source and the target speakers. On the other hand, our proposed approach independently constructs the speaker model of the target, and integrates it with the joint density model by a probabilistic man- ner. Therefore it works well even if the amount of training data for the joint density model is small. In this paper, we try other two problems in voice conversion studies; the degradation of the perceptual quality of the converted speech caused by the discontinuity through the parameter trajec- tory, and the difculty to optimize the model structure of conversion models. The rst problem is mainly caused by the frame-by-frame mapping where the correlation of the target feature vectors between frames is not considered. Our MIVC also suffers from this problem. In addition, since parameters in the target speaker model in MIVC are independent of a feature sequence of the source speaker, inappro- priate spectral movement can occur more often than the conventional VC methods even if each frame in the converted features is modeled more precisely. The second problem, the determination of an optimal model structure, is one of the most difcult problems in statistical acous- tic modeling. For example, in the conventional GMM-based voice conversion, if the number of Gaussian components is increased un- necessarily, it causes the degradation of the performance of the con- version for test sentences. It is well-known as the over-training ef- fect. In our case of MIVC, optimization of model structure is more difcult because mutual inuences between the joint density model and the speaker model should be considered. For the above problems, there have been several proposed ap- proaches in various areas; lter-based approach [8], maximum like- lihood estimation of the parameter trajectory [9, 10] for the rst problem, and acoustic modeling based on the MDL criterion [11] or variational Bayesian treatment [12] for the second one. Consider- ing them, in this paper, we employ two approaches in our method; a parameter generation algorithm using dynamic features for the rst problem and a model optimization based on an information criterion including mutual inuences between both the models for the second one. Experimental results show that the rst approach improved the performance of VC and the second one appropriately predicted the optimal number of components of the speaker model for our MIVC. 2. MODEL-INTEGRATION-BASED VOICE CONVERSION This section briey describes the joint density GMM method [1] and our model-integration-based voice conversion (MIVC) [5]. Let X =[x 1, x2,..., xnx ] be a vector sequence characterizing an ut- terance from the source speaker, and Y =[y 1 , y 2 ,..., y ny ] be that 4576 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011