Eurospeech 2001 - Scandinavia Analysis of Speaker Variability Chao Huang, Tao Chen, Stan Li, Eric Chang and Jianlai Zhou Microsoft Research, China 5F, Beijing Sigma Center, No. 49, Zhichun Road Haidian District Department of Automation, Tsinghua University Beijing 100080, P.R.C chaoh@microsoft.com Abstract Analysis and modeling of speaker variability, such as gender, accent, age, speech rate, and phones realizations, are important issues in speech recognition. It is known that existing feature representations describing speaker variations can be of very high dimension. In this paper, we introduce two powerful multivariate statistical analysis methods, namely, principal component analysis (PCA) and independent component analysis (ICA), as tools for analysis of such variability and extraction of low dimensional feature representation. Our findings are the following: (1) the first two principal components correspond to the gender and accent, respectively. The result that the second component corresponding to the accent has never been reported before, to the best of our knowledge. (2) It is shown that ICA based features yield better classification performance than PCA ones. Using 2- dimensional ICA representation, we achieved about 6.1% and 13.3% error rate in gender and accent classification, respectively, for 980 speakers. 1. Introduction Speaker variability, such as gender, accent, age, speech rate, and phones realizations, is one of the main difficulties in speech signals. How they correlate each other and what the key factors are in speech realization are real concerns in speech research. As we know, performance of speaker- independent (SI) recognition systems is generally 2~3 times worse than that of speaker-dependent ones. As an alternative, different adaptation techniques, such as MAP and MLLR, have been used. The basic idea is to adjust the SI model and make it reflect intrinsic characteristics about specific speakers by re-training the system using appropriate corpora. Another method to deal with the speaker variability problem is to build multiple models of smaller variances, such as gender dependent model and accent dependent model, and then use a proper model selection scheme for the adaptation. SI system and speaker adaptation can be facilitated if the principal variances can be modeled and corresponding compensations can be made. Another difficulty in speech recognition is the complexity of speech models. There can be a huge number of free parameters associated with a set of models. In other words, a representation of a speaker has to be high-dimensional when different phones are taken into account. How to analyze such data is a challenge. Fortunately, several powerful tools, such as principal component analysis (PCA) [2] and more recently independent component analysis (ICA) [1], are available for high dimension multivariate statistical analysis. They have been applied widely and successfully in many research fields such as pattern recognition, learning and image analysis. Recent years have seen some applications in speech analysis [5] [6] [7]. PCA decorrelates second order moments corresponding to low frequency property and extracts orthogonal principal components of variations. ICA is a linear, not necessarily orthogonal, transform which makes unknown linear mixtures of multi-dimensional random variables as statistically independent as possible. It not only decorrelates the second order statistics but also reduces higher-order statistical dependencies. It extracts independent components even if their magnitudes are small whereas PCA extracts components having largest magnitudes. ICA representation seems to capture the essential structure of the data in many applications including feature extraction and signal separation In this paper, we present a subspace analysis method for the analysis of speaker variability and for the extraction of low- dimensional speech features. The transformation matrix obtained by using maximum likelihood linear regression (MLLR) is adopted as the original representation of the speaker characteristics. Generally each speaker is a super- vector which includes different regression classes (65 classes at most), with each class being a vector. Important components in a low-dimensional space are extracted as the result of PCA or ICA. We find that the first two principal components clearly present the characteristics about the gender and accent, respectively. That the second component corresponds to accent has never been reported before, while it has been shown that the first component corresponds to gender [6] [7]. Furthermore, using ICA features can improve classification performance than using PCA ones. Using the ICA representation and a simple threshold method, we achieve gender classification accuracy of 93.9% and accent accuracy of 86.7% for a data set of 980 speakers. The paper is organized as follow. In section 2, we will highlight the basic ideas of PCA and ICA and some related work. The original and efficient speaker representation will also be discussed here. Detailed experiments setups and result analysis will be given in section 3. Section 4 concluded with our findings and possible applications discussions. 2. Speaker Variance Investigations 2.1. Related work PCA and ICA have been widely used in image processing, especially in face recognition, identification and tracing.