Model-Based Speaker Normalization Methods for Speech
Recognition
Masaki Naito,
1
Li Deng,
2
and Yoshinori Sagisaka
1
1
ATR Interpreting Telecommunications Research Laboratories, Kyoto, 619-0237 Japan
2
Department of Electrical and Computer Engineering, Waterloo University, Waterloo, Ontario, N2L 3G1 Canada
SUMMARY
A speaker normalization method using a speech gen-
eration model is proposed in order to achieve high-perform-
ance speaker adaptation with a small amount of adaptation
data. The speaker- and phoneme-dependent vocal tract area
function is approximated by the corresponding area func-
tion produced by the articulatory model of a standard
speaker, combined with phoneme-independent feature
quantities of the vocal-tract shape of the normalized target
speaker as estimated from the formant frequencies of two
vowels. The frequency warping functions are determined
from the formant frequencies of speech calculated from the
vocal-tract area functions thus obtained, and normalization
of the uttered speech is performed by stretching the speech
spectrum in the frequency-axis direction. Continuous pho-
neme recognition experiments using phoneme connection
rules show that the recognition error using a gender-de-
pendent model is reduced by about 30% in the proposed
method and that recognition performance superior to that
of vocal-tract length normalization is obtained. The recog-
nition performance of the proposed method is also equiva-
lent to that of speaker adaptation by moving vector field
smoothing (VFS) using 10 phonetically balanced sen-
tences, showing that high-performance speaker adaptation
using a small amount of adaptation data can be achieved by
the proposed method. © 2003 Wiley Periodicals, Inc. Elec-
tron Comm Jpn Pt 2, 86(2): 45–56, 2003; Published online
in Wiley InterScience (www.interscience.wiley. com). DOI
10.1002/ecjb.10119
Key words: vocal tract shape; articulatory model;
vocal-tract area functions; frequency warping; speaker nor-
malization.
1. Introduction
Speaker adaptation methods that take account of the
acoustic features of speech have been proposed in the past.
But when few speech data are used for adaptation, they
provide little information on the uttered speech, and as a
result, little improvement of recognition performance is
achieved. Recently, speaker normalization methods taking
account of the vocal-tract length of the speaker have been
proposed as a means for producing acoustical models that
exclude speaker characteristics [1, 2]. The vocal-tract shape
of the speaker is important in determining the acoustical
features of the speaker’s utterances, and it may be possible
to supplement an inadequate amount of adaptation data by
using a small number of parameters characterizing the
vocal-tract shape of the individual speaker, such as vocal-
tract length, and a knowledge of the shape and movement
of the articulatory organs.
We therefore propose in this paper a speaker normali-
zation method that uses acoustic features of speech that are
estimated from the vocal-tract shape of the speaker. In this
© 2003 Wiley Periodicals, Inc.
Electronics and Communications in Japan, Part 2, Vol. 86, No. 2, 2003
Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J83-D-II, No. 11, November 2000, pp. 2360–2369
45