Computer Speech and Language (1998) 12, 41–50 N-Best-based unsupervised speaker adaptation for speech recognition Tomoko Matsui* and Sadaoki Furui† *NTT Human Interface Laboratories, 1-1 Hikari-no-oka, Yokosuka-shi, Kanagawa, 239 Japan †Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152 Japan Abstract This paper proposes an instantaneous speaker adaptation method that uses N-best decoding for continuous mixture-density hidden-Markov- model-based speech-recognition systems. This method is eﬀective even for speakers whose decoding using speaker-independent (SI) models are error-prone and for whom speaker adaptation techniques are truly needed. In addition, smoothed estimation and utterance veriﬁcation are introduced into this method. The smoothed estimation is based on the likelihood values for adapted models of word sequences obtained by N-best decoding and improves the performance of error-prone speakers, and the utterance veriﬁcation technique reduces the amount of calculation required. Performance evaluation using connected-digit (four-digit strings) recognition experiments performed over actual telephone lines showed a reduction of 36·4% in the error rates of speakers whose decoding using SI models are error-prone.  1998 Academic Press Limited 1. Introduction In continuous mixture-density hidden Markov model (HMM)-based speech-recognition systems, the performance of speaker-independent (SI) phoneme HMMs for some speakers is often poor. Techniques that adapt the parameters of SI phoneme HMMs to each speaker and thus improve the performance are therefore important. These techniques are usually classiﬁed as supervised, in which training utterances with the transcriptions are used, or unsupervised, in which utterances without the transcriptions are used. They can also be classiﬁed as either oﬀ-line or on-line. Instantaneous adaptation is unsupervised and on-line: the recognition utterances are used to estimate the adaptation transformation. It is especially useful in applications where there is only a very brief interaction between the speaker and the system. This technique must work using only a small amount of data, such as a few words or a single sentence (Furui, 1989; Zavaliagkos, Schwartz & Makhoul, 1995; Sankar, Neumeyer & Weintraub, 1996). In general, unsupervised adaptation techniques use a recognized word sequence, W*, obtained using SI phoneme HMMs. A parameter set, , of the SI phoneme HMMs is 0885–2308/98/010041+10 $25.00/0/la970036  1998 Academic Press Limited