AUTOMATIC ACCENT IDENTIFICATION USING GAUSSIAN MIXTURE MODELS Tao Chen *, + , Chao Huang, Eric Chang and Jingchun Wang * Microsoft Research China 5F, Sigma Center, No. 49, Zhichun Road, Beijing 100080, P.R.C * Department of Automation, Tsinghua University {chentao,wang-jc}@proc.au.tsinghua.edu.cn; {chaoh, echang}@microsoft.com + Work carried out as visiting student at MSR China. ABSTRACT It is well known that speaker variability caused by accent is an important factor in speech recognition. Some major accents in China are so different as to make this problem very severe. In this paper, we propose a Gaussian mixture model (GMM) based Mandarin accent identification method. In this method, a number of GMMs are trained to identify the most likely accent given test utterances. The identified accent type can be used to select an accent-dependent model for speech recognition. A multi-accent Mandarin corpus was developed for the task, including 4 typical accents in China with 1,440 speakers (1,200 for training, 240 for testing). We explore experimentally the effect of the number of components in GMM on identification performance. We also investigate how many utterances per speaker are sufficient to reliably recognize his/her accent. Finally, we show the correlations among accents and provide some discussions. 1. INTRODUCTION Speaker variability, such as gender, accent, age, speaking rate, and phones realizations, is one of the main difficulties in speech recognition task. It is shown in [1] that gender and accent are the two most important factors in speaker variability. Usually, gender-dependent model is used to deal with the gender variability problem. In China, almost each province has its own dialect. When speaking Mandarin, the speaker’s dialect greatly affects his/her accent. Some typical accents, such as Beijing, Shanghai, Guangdong and Taiwan, are quite different from each other in acoustic characteristics. Similar to gender variability, a simple method to deal with accent problem is to build multiple models of smaller accent variances, and then use a model selector for the adaptation. Cross accents experiments [2] show that performance of accent-independent systems is generally 30% worse than that of accent-dependent ones. Thus it is meaningful to develop an accent identification method with acceptable error rate. Current accent identification research focuses on foreign accent problem. That is, identifying non-native accents. Teixeira et al. [3] proposed a Hidden Markov Model (HMM) based system to identify English with 6 foreign accents. A context independent HMM was used since the corpus consisted most of isolated words, which is not always the case in applications. Hansen and Arslan [4] also built HMM to classify foreign accent of American English. They analyzed some prosodic features’ impact on classification performance and concluded that carefully selected prosodic features would improve the classification accuracy. Instead of phoneme-based HMM, Fung and Liu [5] used phoneme-class HMMs to differentiate Cantonese English from native English. Berkling et al. [6] added English syllable structure knowledge to help recognize 3 accented speaker groups of Australian English. Although foreign accent identification is extensively explored, little has been done to domestic one, to the best of our knowledge. Actually, domestic accent identification is more challenging: 1) Some linguistic knowledge, such as syllable structure used in [6], is of little use since people seldom make such mistakes in their mother language; 2) Difference among domestic speakers is relatively smaller than that among foreign speakers. In our work, we want to identify different accent types spoken by people with the same mother language. Most of current accent identification systems, as mentioned above, are built based on the HMM framework. Although HMM is effective in classifying accents, its training procedure is time-consuming. Also, using HMM to model every phoneme or phoneme-class is computationally expensive. Furthermore, HMM training is a supervised one: it needs phone transcriptions. The transcriptions are either manually labeled, or obtained from a speaker independent model, in which the word error rate will certainly degrade the identification performance. In this paper, we propose a GMM based method for the identification of domestic speaker accent. 4 typical Mandarin accent types are explored. Since phoneme or phoneme class information are out of our concern, we just model accent characteristics of speech signals. GMM training is an unsupervised one: no transcriptions are needed. We train two GMMs for each accent: one for male, the other for female, as gender is the greatest speaker variability. Given test utterances, the speaker’s gender and accent can be identified at the same time, compared with the two-stage method in [3]. The relationship between GMM parameter and recognition accuracy