GENDER RECOGNITION AND GENDER-BASED ACOUSTIC MODEL ADAPTATION FOR TELEPHONE-BASED SPOKEN DIALOG SYSTEM Kinfe Tadesse Mengistu, Martin Schafföner, Andreas Wendemuth Cognitive Systems Group, Otto-von-Guericke University Kinfe.Tadesse@E-Technik.Uni-Magdeburg.de Abstract: In this paper we describe the speech recognition component of a telephone- based spoken dialog system that uses HTK-based speech recognizer integrated in a VoiceXML framework and an ISDN telephone interface. As the speech recognizer component is one of the most decisive components that determine the usefulness and user acceptance of a dialog system, we present here strategies on how to build and improve the performance of a speech recognition component within such a system. The baseline speaker-independent system gives a word error rate (WER) of 13.66% for female speakers and 21.55% for male speakers using a 22-hour telephone speech from the Communicator 2001 Evaluation corpus. As can be observed, the system appears biased towards female speakers. This is attributed to the fact that the number of female speakers used in training the models is significantly higher than male speakers (72 vs. 28). To combat this problem and to improve the performance of the system for male speakers, we use two approaches. First, taking the presence of within-gender acoustic similarity due to similar vocal mechanism of speakers into consideration, we adapt the speaker independent HMMs using adaptation data from each gender. As an alternative, separate gender-dependent models are built. We also built a Gaussian Mixture Model (GMM) gender classifier that can determine the gender of the speaker given a very short utterance (typically a “yes” or a “no”) with 96.62% accuracy. 1 Introduction A telephone-based spoken dialog system is comprised of a telephone network interface to deliver calls into the system, a speech recognizer to accept requests from users, a text to speech synthesizer (TTS engine) for playing prompts and responses to the caller, a semantic interpreter for comprehending requests, a mechanism for response generation, and a dialog manager to orchestrate the various components. The speech recognizer in our dialog system uses HTK [1] to build recognition resources and its API (ATK) to build a real-time speech recognizer [2] integrated in a VoiceXML framework. Among other features, ATK allows a flexible use of resources during the recognition process. It uses a global configuration file where HTK compatible HMM models and other recognition resources such as grammar, HMM list, and pronunciation lexicon are specified [2]. This makes it possible to use the same framework for various application domains and languages by simply building the necessary recognition resources offline and specifying them in the configuration file. The choice of an open VoiceXML platform is an important design decision. We have chosen OptimTalk 1 as it is open enough to allow the integration of our own speech recognizer, telephone interface, etc. 1 http://www.optimsys.cz/ 68