HIGH ACCURATE MODEL-INTEGRATION-BASED VOICE CONVERSION
USING DYNAMIC FEATURES AND MODEL STRUCTURE OPTIMIZATION
Daisuke Saito
1
, Shinji Watanabe
2
, Atsushi Nakamura
2
, and Nobuaki Minematsu
1
1
The University of Tokyo, Tokyo, Japan
2
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
{dsk saito,mine}@gavo.t.u-tokyo.ac.jp, {watanabe,ats}@cslab.kecl.ntt.co.jp
ABSTRACT
This paper combines a parameter generation algorithm and a model
optimization approach with the model-integration-based voice con-
version (MIVC). We have proposed probabilistic integration of a
joint density model and a speaker model to mitigate a requirement of
the parallel corpus in voice conversion (VC) based on Gaussian Mix-
ture Model (GMM). As well as the other VC methods, MIVC also
suffers from the problems; the degradation of the perceptual quality
caused by the discontinuity through the parameter trajectory, and the
difculty to optimize the model structure. To solve the problems,
this paper proposes a parameter generation algorithm constrained
by dynamic features for the rst problem and an information cri-
terion including mutual inuences between the joint density model
and the speaker model for the second problem. Experimental results
show that the rst approach improved the performance of VC and
the second approach appropriately predicted the optimal number of
mixtures of the speaker model for our MIVC.
Index Terms— Voice conversion, probabilistic integration, dy-
namic features, information criterion
1. INTRODUCTION
Voice conversion (VC) is a technique to transform an inputted ut-
terance of a speaker to another utterance that sounds like another
speaker’s voice without changing the linguistic content. VC can be
regarded as a technique to modify inputted features to features of a
desired target. Then VC techniques have potentials of applying to
many research areas of speech processing besides speech synthesis
or speech generation [1, 2].
To derive appropriate features of a target speaker from a source
speaker’s features by VC techniques, two important functions should
be considered; to model the proper correspondence of the source fea-
tures to the target features, and to represent a feature space of the tar-
get precisely. Although there have been several proposed techniques
for voice conversion based on statistical approaches [1, 3, 4], they
strongly focus on the rst function. To realize this function, they re-
quire the parallel corpus for training, which contains plenty of utter-
ances with the same linguistic content both the source and the target.
On the other hand, we have proposed the model-integration-based
voice conversion (MIVC) which focuses not only on the rst func-
tion, but also on the second function, i.e., to model the precise feature
space of the target speaker [5]. Our method uses non-parallel speech
data of the target speaker to construct the speaker model of the tar-
get. Then it effectively mitigates the data sparse problem caused by
the requirement of the parallel corpus. There are other approaches
focusing on the efcient use of non-parallel data [6, 7]. They have
applied parameter adaptation techniques to parameters of the joint
density model, which is constructed to model the relation between
the source and the target speakers. On the other hand, our proposed
approach independently constructs the speaker model of the target,
and integrates it with the joint density model by a probabilistic man-
ner. Therefore it works well even if the amount of training data for
the joint density model is small.
In this paper, we try other two problems in voice conversion
studies; the degradation of the perceptual quality of the converted
speech caused by the discontinuity through the parameter trajec-
tory, and the difculty to optimize the model structure of conversion
models. The rst problem is mainly caused by the frame-by-frame
mapping where the correlation of the target feature vectors between
frames is not considered. Our MIVC also suffers from this problem.
In addition, since parameters in the target speaker model in MIVC
are independent of a feature sequence of the source speaker, inappro-
priate spectral movement can occur more often than the conventional
VC methods even if each frame in the converted features is modeled
more precisely.
The second problem, the determination of an optimal model
structure, is one of the most difcult problems in statistical acous-
tic modeling. For example, in the conventional GMM-based voice
conversion, if the number of Gaussian components is increased un-
necessarily, it causes the degradation of the performance of the con-
version for test sentences. It is well-known as the over-training ef-
fect. In our case of MIVC, optimization of model structure is more
difcult because mutual inuences between the joint density model
and the speaker model should be considered.
For the above problems, there have been several proposed ap-
proaches in various areas; lter-based approach [8], maximum like-
lihood estimation of the parameter trajectory [9, 10] for the rst
problem, and acoustic modeling based on the MDL criterion [11]
or variational Bayesian treatment [12] for the second one. Consider-
ing them, in this paper, we employ two approaches in our method; a
parameter generation algorithm using dynamic features for the rst
problem and a model optimization based on an information criterion
including mutual inuences between both the models for the second
one. Experimental results show that the rst approach improved the
performance of VC and the second one appropriately predicted the
optimal number of components of the speaker model for our MIVC.
2. MODEL-INTEGRATION-BASED VOICE CONVERSION
This section briey describes the joint density GMM method [1]
and our model-integration-based voice conversion (MIVC) [5]. Let
X =[x
1, x2,..., xnx
] be a vector sequence characterizing an ut-
terance from the source speaker, and Y =[y
1
, y
2
,..., y
ny
] be that
4576 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011