Effects of Speaker Adaptive Training on Tensor-based Arbitrary Speaker Conversion Daisuke Saito 1 , Nobuaki Minematsu 2 , Keikichi Hirose 1 1 Graduate School of Information Science and Technology, The University of Tokyo, Japan 2 Graduate School of Engineering, The University of Tokyo, Japan dsaito@hil.t.u-tokyo.ac.jp, {mine,hirose}@gavo.t.u-tokyo.ac.jp Abstract This paper introduces speaker adaptive training techniques to tensor-based arbitrary speaker conversion. In voice conversion studies, realization of conversion from/to an arbitrary speaker’s voice is one of the important objectives. For this purpose, eigen- voice conversion (EVC), which is based on an eigenvoice Gaus- sian mixture model (EV-GMM), was proposed. Although the EVC can effectively construct the conversion model for arbi- trary target speakers using only a few utterances, increase of the utterances used to construct the conversion model does not always improve the conversion performance. This is because the EV-GMM method has an inherent problem in representation of GMM supervectors. We previously proposed tensor-based speaker space as a solution for this problem, and realized more flexible control of speaker characteristics. In this paper, to aim larger improvement of the performance of VC, speaker adaptive training and tensor-based speaker representation are integrated. The proposed method can construct the flexible and precise con- version model, and experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed ap- proach. Index Terms: voice conversion, Gaussian mixture model, eigenvoice, Tucker decomposition, speaker adaptive training 1. Introduction Voice conversion (VC), or speaker conversion is a technique to transform an input utterance of a speaker to another utterance that sounds like another speaker with its linguistic content pre- served [1]. VC techniques can apply to various applications besides speech synthesis [2, 3]. Among several statistical ap- proaches to construct the conversion model, GMM-based ap- proaches are widely used because of their flexibility [2, 4]. When constructing the conversion model, however, a par- allel training corpus, which are a set of utterance pairs of the same sentences spoken by a source and a target speakers, are required. This requirement limits the applicability of the con- version model to the specific speaker pair. Hence, flexible con- trol of speaker characteristics with little need of a parallel cor- pus is an important objective of VC. For this purpose, several adaptation techniques using voices of other speakers have been proposed [5, 6]. These approaches are inspired by speaker adap- tation techniques in speech recognition studies. Among these, eigenvoice conversion (EVC) [6], which uses the eigenvoice technique proposed in speech recognition [7], is implemented by constructing a speaker space. Based on training with multi- ple pre-stored parallel data sets, a speaker space is constructed utilizing GMM supervector, in a similar manner to speaker recognition studies [8]. Then, adaptation to an arbitrary speaker becomes the problem to locate that speaker in the constructed speaker space. Hence, precise construction of the speaker space is important for improvement of the performance of voice con- version. However, the representation of GMM supervector has an inherent problem that multiple factors of acoustic variations are included in the same space. Hence, scalability of adaptation performance of EVC is limited caused by the problem. We have recently proposed a new representation of speaker space based on tensor analysis for arbitrary speaker conversion [9]. In our approach, an arbitrary speaker is not represented as a supervector, but as a matrix whose row and column respectively correspond to the component of GMM and the dimension of the mean vector. Using this representation, we can express the data set of the pre-stored speakers as a third-order tensor, and intro- duce the tensor analysis to obtain the speaker space. Based on this speaker space, Tensor-based Arbitrary Speaker Conversion (TASC) has been realized and the effectiveness of TASC, com- pared with EVC, was shown by the one-to-many VC task [9]. Because our approach is a new method of representing a speaker space, it can be flexibly integrated with other effec- tive techniques which are independent of speaker space. In this paper, we introduce speaker adaptive training for TASC. Speaker adaptive training was introduced for training a canoni- cal speaker-independent model [10], and its effectiveness in ar- bitrary speaker conversion was shown in [11]. This paper inves- tigates the effects of speaker adaptive training when it is applied to tensor-based flexible speaker representation. 2. Eigenvoice conversion (EVC) 2.1. Eigenvoice GMM (EV-GMM) In this section, one-to-many EVC [6] is briefly described. Let X t =[x ⊤ t , Δx ⊤ t ] ⊤ and Y (s) t =[y (s) ⊤ t , Δy (s) ⊤ t ] ⊤ be 2D- dimensional vectors of the source speaker and the s-th target speaker, respectively. They consist of D-dimensional static and dynamic features. The notation (·) ⊤ denotes transposition of a vector. The joint probability density of the source and the target vectors is modeled by an EV-GMM as follows: P (X t , Y (s) t |λ (EV ) , w (s) ) = M X m=1 αmN ([X ⊤ t , Y (s) ⊤ t ] ⊤ ; μ (Z) m (w (s) ), Σ (Z) m ), (1) μ (Z) m (w (s) )= » μ (X) m B m w (s) +b (0) m – , Σ (Z) m = " Σ (XX) m Σ (XY ) m Σ (YX) m Σ (YY ) m # , (2) where N (x; μ, Σ) denotes the normal distribution with a mean vector μ and a covariance matrix Σ. The weight of the m- th component is denoted as α m , and the number of mixture ISCA Archive http://www.isca-speech.org/archive INTERSPEECH 2012 ISCA's 13 th Annual Conference Portland, OR, USA September 9-13, 2012 INTERSPEECH 2012 98 10.21437/Interspeech.2012-35