Robust Romanian Language Automatic Speech Recognizer Based on Multistyle Training DORU-PETRU MUNTEANU, CONSTANTIN-IULIAN VIZITIU Communications and Electronic Systems Department Military Technical Academy G.Cosbuc 81-83, 050141, Bucharest ROMANIA munteanud@mta.ro, vic@mta.ro http://www.mta.ro Abstract: - This paper presents solutions for increasing environmental robustness of a Romanian language continuous speech recognizer, previously developed. All state-of-the-art automatic speech recognizers (ASR) are data-driven and rely heavily on huge speech data for estimating the model parameters. Most of the available speech corpora used for this training phase contain clean speech recorded in low noise and reverberation free environments with high quality audio equipment. However, in real-world ASR are facing various acoustic conditions, speech signal being degraded by noise, reverberations, convolution distortions, etc. The acoustic mismatches between the training conditions and testing conditions are the main cause of ASR performance degradation. For instance, the word error rate may be an order of magnitude higher in an office environment than in a clean laboratory environment. There are a lot of methods and techniques aiming to keep the ASR performances at an acceptable in various acoustic conditions. In this paper we are presenting a special strategy called multistyle training for building a robust Romanian language ASR system. The method is based on training the recognizer with degraded speech signal obtained by adding to clean speech various levels artificial noise. Experimental results presented, prove that this scheme strongly increase the system robustness to additive noise. The system architecture based on context-dependent HMM phonemes is also described in detail. Key-Words: - continuous speech recognition, environmental robustness, multistyle training, context dependent models, hidden Markov models 1 Introduction Automatic speech recognition is still a subject for scientific research world-wide because it can offer cheap solutions in man-machine interaction. The recognition performances were increased every year in the last decades. A big challenge that both commercial and research ASRs have to address is the recognition robustness. There are various environmental factors that lead to speech signal degradation from the time it leaves the mouth until it reaches in digital format. Most of the speech corpora contain clean speech recorded in low-noise reverberation-free conditions [7], [8]. Speech recognition systems performances trained with clean speech are known to degrade significantly in the real world applications [9] due to several factors that affect the speech signal such as additive noise (fans, air conditioning, door slams, keyboard or mouse clicks, etc.) or channel distortions (reverberations, microphone frequency response, A/D converter input filter, etc). There are two important strategies for increasing systems robustness: speech enhancement (e.g., spectral noise subtraction, echo cancellation) and acoustical model-based methods (e.g. adaptation techniques, parallel model combination, multistyle training). The speech recognizer proposed in this paper is based on mainly two environmental methods: • Cepstral mean normalization (CMN) – reduces convolutive channel distortion • Multistyle training – adapts the models to additive stationary noise Experimental results prove that system robustness is greatly improved for a wide range of the signal to noise ratio (SNR). Although we have modeled white Gaussian noise only, the method can be applied for any type of additive noise that could corrupt speech in various acoustic environments. In this paper, the speech recognizer architecture is described first. The Romanian language ASR uses phoneme-based hidden Markov models (HMMs) with Gaussian distribution. Also a voice activity detector is used for real-time recognition in the testing phase. Then context-dependent (CD) modeling is used for training first order CD HMMs (triphones) in order to increase the ASR performances. This CD modeling is a very WSEAS TRANSACTIONS on COMPUTER RESEARCH Doru-Petru Munteanu and Constantin-Iulian Vizitiu ISSN: 1991-8755 98 Issue 2, Volume 3, February 2008