EIGENSPACE-BASED MAXIMUM A POSTERIORI LINEAR REGRESSION FOR RAPID SPEAKER ADAPTATION Kuan-ting Chen and Hsin-min Wang Institute of Information Science, Academia Sinica Taipei, Taiwan, Republic of China email : {kenneth, whm}@iis.sinica.edu.tw ABSTRACT In this paper, we present an eigenspace-based approach toward prior density selection for the MAPLR framework. The proposed eigenspace-based MAPLR approach was developed by introducing a priori knowledge analysis on the training speakers via probabilistic principal component analysis (PPCA), so as to construct an eigenspace for speaker-specific full regression matrices as well as to derive a set of bases called eigen-matrices. The priors of MAPLR transformations for each outside speaker are then chosen in the space spanned by the first K eigen-matrices. By incorporating the PPCA model into the MAPLR scheme, the number of free parameters in choosing the priors can be effectively reduced, while the underlying structure of the acoustic space as well as the precise modeling of the inter-dimensional correlation among the model parameters can be well preserved. Both supervised and unsupervised adaptation experiments showed that the proposed approach significantly outperformed the conventional MLLR approach using either diagonal or full regression matrices. 1. INTRODUCTION Various speaker adaptation techniques have been extensively studied in recent years to tackle the problem of speaker mismatch between the training set and the testing set of the speech recognition systems. According to [1], the popular model-based adaptation techniques can be classified into three families: the maximum a posteriori (MAP) adaptation family, the transformation-based adaptation family including maximum likelihood linear regression (MLLR) [2], and a family related to speaker clustering methods such as the eigenvoice approach [3]. In this paper, we will focus on the adaptation of mean parameters of the Gaussian mixture components in continuous density HMMs. Among these techniques, MLLR approach has been widely used for rapid adaptation and unsupervised adaptation. In MLLR, the speaker independent (SI) mean parameters are adjusted according to one or more shared linear transformations. The transformation parameter tying mechanism based on the design of regression class tree can adequately adjust the level of regression matrix parameter sharing according to the amount and content of data and, thus, can effectively improve the robustness of parameter estimation against the sparse data problem. There are several drawbacks of the conventional MLLR approach. In MLLR for mean adaptation, it is known that using full regression matrices can model the inter-dimensional correlation among the mean parameters more precisely and, thus, can provide superior description of speaker characteristics over the use of diagonal regression matrices [2]. However, the large number of parameters makes robust estimation of full regression matrices very difficult, especially when the amount of adaptation data available is strictly limited. This problem can be alleviated by specifying a prior distribution for each of the regression matrices, and estimating the transformations in maximum a posteriori (MAP) sense. This leads to the maximum a posteriori linear regression (MAPLR) formulation [4]. Provided that good priors are chosen, the estimation of the regression matrices could be more robust. On the other hand, it is believed that the a priori knowledge about the inter-speaker variation can be explored by analyzing the training corpus. The eigenvoice approach [4] introduced for fast speaker adaptation is one of the examples that realize such concept. The eigenvoice technique finds the new speaker model as the linear combination of a set of canonical speaker models called eigenvoices. These eigenvoices, which characterize the a priori information of the training speakers, are constructed by performing principal component analysis (PCA) [10] on a set of speaker dependent (SD) model parameters. Recently, the eigenvoice approach was further extended via the PPCA model [6] and incorporated into the Bayesian adaptation framework [7]. In the transformation-based adaptation methods, each set of transformations for a specific speaker represents a mapping from the SI models to the SD models for that speaker and, thus, can be considered as a quantitative description of the speaker characteristics. It is clear that the a priori information can also be obtained by analyzing the transformations for the training speakers. In our previous work [8], the eigenspace of the speaker-specific full regression matrices was utilized to improve the conventional full-matrix MLLR when the amount of adaptation data was strictly limited. To alleviate the problem of performance saturation as the amount of adaptation data increased, the eigenspace-based transformations were used as a priori for a smoothing procedure on the conventional MLLR transformations instead of used directly for adaptation. In this paper, we will further extend such idea to the formulation of eigenspace-based MAPLR estimation, and propose a framework for priors choosing by employing the PPCA model. The rest of this paper is organized as follows. The eigenspace-based transformations and the PPCA model are introduced in Sections 2 and 3 respectively. The MAPLR framework is presented in Section 4. Finally some experimental results for both supervised and unsupervised adaptation tested on a continuous Mandarin Chinese telephone speech database are discussed in Section 5, and concluding remarks are made in Section 6. Thanks to Institute for Information Industry and National Science Council of the Republic of China for funding.