A. Elmoataz et al. (Eds.): ICISP 2012, LNCS 7340, pp. 571–578, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Robust Arabic Multi-stream Speech Recognition System
in Noisy Environment
Anissa Imen Amrous and Mohamed Debyeche
Speech Communication and Signal Processing Laboratory (LPCTS),
Faculty of Electronics and Computer Sciences, USTHB
P.O. Box 32, Bab Ezzouar, Algiers, Algeria
amrous_im@hotmail.fr, mdebyeche@gmail.com
Abstract. In this paper, the framework of multi-stream combination has been
explored to improve the noise robustness of automatic speech recognition
systems. The main important issues of multi-stream systems are which features
representation to combine and what importance (weights) be given to each one.
Two stream features have been investigated, namely the MFCC features and a
set of complementary features which consists of pitch frequency, energy and
the first three formants. Empiric optimum weights are fixed for each stream.
The multi-stream vectors are modeled by Hidden Markov Models (HMMs)
with Gaussian Mixture Models (GMMs) state distributions. Our ASR is
implemented using HTK toolkit and ARADIGIT corpus which is data base of
Arabic spoken words. The obtained results show that for highly noisy speech,
the proposed multi-stream vectors leads to a significant improvement in
recognition accuracy.
Keywords: Multi-stream speech recognition, HMM, noisy environments.
1 Introduction
Improve the robustness of automatic speech recognition in presence of additive noise
has become an active topic and a number of techniques has been proposed to improve
word accuracies in noisy environments. The use of multi-stream models is one such
technique [1]. A multi-stream speech recognizer is based on the combination of
multiple feature streams each containing complementary information. The
performance of such system depends on the fact that the selected features for every
stream must not go through the same distortion in presence of noise. The weight given
to each stream is another important aspect in multi-stream combination system. The
rule should be such that the streams that are reliable should get more weight
compared to the stream corrupted by noise [2], [3], [4].
We can refer to many works that tried to improve the robustness of ASR system by
using several streams of features that rely on different underlying assumptions and
exhibit different properties. Shimmer and jitter are used in [5], and formant and
auditory-based acoustic cues are used together with MFCC in [6], [7]. In [8], [9], a
multi-stream approach is used to combine MFCC features with formant estimates and