Model-Based Speaker Normalization Methods for Speech Recognition Masaki Naito, 1 Li Deng, 2 and Yoshinori Sagisaka 1 1 ATR Interpreting Telecommunications Research Laboratories, Kyoto, 619-0237 Japan 2 Department of Electrical and Computer Engineering, Waterloo University, Waterloo, Ontario, N2L 3G1 Canada SUMMARY A speaker normalization method using a speech gen- eration model is proposed in order to achieve high-perform- ance speaker adaptation with a small amount of adaptation data. The speaker- and phoneme-dependent vocal tract area function is approximated by the corresponding area func- tion produced by the articulatory model of a standard speaker, combined with phoneme-independent feature quantities of the vocal-tract shape of the normalized target speaker as estimated from the formant frequencies of two vowels. The frequency warping functions are determined from the formant frequencies of speech calculated from the vocal-tract area functions thus obtained, and normalization of the uttered speech is performed by stretching the speech spectrum in the frequency-axis direction. Continuous pho- neme recognition experiments using phoneme connection rules show that the recognition error using a gender-de- pendent model is reduced by about 30% in the proposed method and that recognition performance superior to that of vocal-tract length normalization is obtained. The recog- nition performance of the proposed method is also equiva- lent to that of speaker adaptation by moving vector field smoothing (VFS) using 10 phonetically balanced sen- tences, showing that high-performance speaker adaptation using a small amount of adaptation data can be achieved by the proposed method. © 2003 Wiley Periodicals, Inc. Elec- tron Comm Jpn Pt 2, 86(2): 45–56, 2003; Published online in Wiley InterScience (www.interscience.wiley. com). DOI 10.1002/ecjb.10119 Key words: vocal tract shape; articulatory model; vocal-tract area functions; frequency warping; speaker nor- malization. 1. Introduction Speaker adaptation methods that take account of the acoustic features of speech have been proposed in the past. But when few speech data are used for adaptation, they provide little information on the uttered speech, and as a result, little improvement of recognition performance is achieved. Recently, speaker normalization methods taking account of the vocal-tract length of the speaker have been proposed as a means for producing acoustical models that exclude speaker characteristics [1, 2]. The vocal-tract shape of the speaker is important in determining the acoustical features of the speaker’s utterances, and it may be possible to supplement an inadequate amount of adaptation data by using a small number of parameters characterizing the vocal-tract shape of the individual speaker, such as vocal- tract length, and a knowledge of the shape and movement of the articulatory organs. We therefore propose in this paper a speaker normali- zation method that uses acoustic features of speech that are estimated from the vocal-tract shape of the speaker. In this © 2003 Wiley Periodicals, Inc. Electronics and Communications in Japan, Part 2, Vol. 86, No. 2, 2003 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J83-D-II, No. 11, November 2000, pp. 2360–2369 45