A. Elmoataz et al. (Eds.): ICISP 2012, LNCS 7340, pp. 571–578, 2012. © Springer-Verlag Berlin Heidelberg 2012 Robust Arabic Multi-stream Speech Recognition System in Noisy Environment Anissa Imen Amrous and Mohamed Debyeche Speech Communication and Signal Processing Laboratory (LPCTS), Faculty of Electronics and Computer Sciences, USTHB P.O. Box 32, Bab Ezzouar, Algiers, Algeria amrous_im@hotmail.fr, mdebyeche@gmail.com Abstract. In this paper, the framework of multi-stream combination has been explored to improve the noise robustness of automatic speech recognition systems. The main important issues of multi-stream systems are which features representation to combine and what importance (weights) be given to each one. Two stream features have been investigated, namely the MFCC features and a set of complementary features which consists of pitch frequency, energy and the first three formants. Empiric optimum weights are fixed for each stream. The multi-stream vectors are modeled by Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) state distributions. Our ASR is implemented using HTK toolkit and ARADIGIT corpus which is data base of Arabic spoken words. The obtained results show that for highly noisy speech, the proposed multi-stream vectors leads to a significant improvement in recognition accuracy. Keywords: Multi-stream speech recognition, HMM, noisy environments. 1 Introduction Improve the robustness of automatic speech recognition in presence of additive noise has become an active topic and a number of techniques has been proposed to improve word accuracies in noisy environments. The use of multi-stream models is one such technique [1]. A multi-stream speech recognizer is based on the combination of multiple feature streams each containing complementary information. The performance of such system depends on the fact that the selected features for every stream must not go through the same distortion in presence of noise. The weight given to each stream is another important aspect in multi-stream combination system. The rule should be such that the streams that are reliable should get more weight compared to the stream corrupted by noise [2], [3], [4]. We can refer to many works that tried to improve the robustness of ASR system by using several streams of features that rely on different underlying assumptions and exhibit different properties. Shimmer and jitter are used in [5], and formant and auditory-based acoustic cues are used together with MFCC in [6], [7]. In [8], [9], a multi-stream approach is used to combine MFCC features with formant estimates and