Extended Conditional GMM and Covariance Matrix Correction for Real-Time Spectral Voice Conversion Pierre Lanchantin, Nicolas Obin, Xavier Rodet IRCAM - CNRS-UMR9912-STMS, Analysis-Synthesis Team, 1, place Igor-Stravinsky, 75004 Paris, France lanchant@ircam.fr, nobin@ircam.fr, rod@ircam.fr Abstract Gaussian mixture model (GMM)-based spectral voice conver- sion (VC) can be performed in real-time by applying the conver- sion method frame by frame. However, this local method can produce inappropriate trajectories of parameters and the con- verted spectrum can be excessively smoothed due to the statis- tical approach. In order to address these limitations, we pro- pose an approach based on a new Extended Conditional GMM model. Two different features vectors are used for the descrip- tion of the source characteristics: one is speciﬁcally designed for a precise description of the spectral features to be trans- formed, the other one being designed for the selection of the transformations to be applied. The latter include local descrip- tors of the trajectories of parameters via Discrete Cosine Trans- form (DCT) coefﬁcients in order to generate local trajectories of parameters. Finally, the effect of over-smoothing is allevi- ated by a covariance matrix correction method. The proposed VC method is evaluated objectively and subjectively, showing a dramatic improvement compared to conventional VC method. Index Terms: Voice conversion, Extended Conditional GMM, Discrete Cosine Transform. 1. Introduction The aim of speaker conversion - a typical application of voice conversion technique (VC) - is to modify the speech signal of a source speaker so as to be perceived as that of a target speaker. The overall methodology for speaker conversion is to deﬁne and learn a mapping function of acoustic features of a source speaker to those of a target speaker. Among other statistical approaches described in [1, 2, 3], one of the most popular method proposed by Stylianou and al. [4] is based on a Gaussian mixture model (GMM) that deﬁnes a continu- ous mapping between the features of source and target voices. Kain extended Stylianou’s work by modelling directly the joint probability density of the source and target speaker’s acoustic space [5]. This method allows the system to capture all the existing correlations between the source and target speaker’s acoustic features. In most cases, the method is applied frame by frame which make its implementation for real-time conversion straightforward. Although this type of method is relatively ef- ﬁcient, conversion performances are still insufﬁcient regarding speech quality: the frame by frame conversion process induces inappropriate spectral parameters trajectories and the converted This study was supported by FEDER Angelstudio : G´ en´ erateur d’Avatars personnalis´ es ; 2009-2011 spectrum can be excessively smoothed. Toda and al. have pro- posed in [6] a method based on maximum likelihood estimation of trajectories of parameters, which greatly improves the qual- ity of synthesis by taking into account the dynamic features and the global variance. However, this method require a global op- timization which can be a problem for real-time applications. In order to address these limitations, we propose a novel approach presented in this study. First, we deﬁne an Extended Conditional GMM (XcGMM) in which the mixture weights de- pend on an alternative representation of the source characteris- tics different from the one used for the description of the spec- tral characteristics to be converted. This modelling allows the use of a high resolution representation of the spectral character- istics to be transformed without necessarily increasing the com- plexity of the model. At the same time, it allows the inclusion of additional informations for the selection of the transformations to apply. In this way, Discrete Cosine Transform (DCT) can be used to stylize the trajectories of the spectral parameters. This additional parameters are taken into account in order to generate local trajectories of parameters. Finally, we propose a covari- ance matrix correction method to overcome the over-smoothing of the transformed spectral characteristics. The paper is organized as follows: Section 2 presents the proposed approach and the related VC system, its optimization and objective evaluation are described in Section 3; ﬁnally sub- jective evaluation of the proposed approach is presented and dis- cussed in Section 4. 2. Proposed Approach Let Z =(X, Y ) be the joint random process of source-target acoustic spectral features in which X = {Xn}n∈N and Y = {Yn}n∈N are the source, and target processes respectively, and N the set of frame indexes. Each Xn and Yn takes its values in R d where d is the dimension of the acoustic feature vector. We will denote z =(x, y)= {(xn,yn)}n∈N a realization of this process, in which xn and yn are the acoustic features vec- tor at frame n for the source and that for the target, respectively. We assume that Z is an independent and identically distributed process (i.i.d.) such as p(z)=  n∈N p(zn). We introduce the auxiliary i.i.d. process of mixture components U = {Un}n∈N , each Un taking its values in U with cardinal K. The joint prob- ability distribution of the source and target features vectors is then modeled by a Gaussian mixture as follows p(zn|φ)= K  k=1 α k N (zn; φ k ) (1)