Robust Distributed Speech Recognition Using Histogram Equalization and Correlation Information Pedro M. Martinez, Jose C. Segura, Luz Garcia Department of Signal Theory, Networking and Communications University of Granada, Spain pmmartinez@auna.com, segura@ugr.es, luzgm@ugr.es Abstract In this paper, we propose a noise compensation method for ro- bust speech recognition in DSR (Distributed Speech Recogni- tion) systems based on histogram equalization and correlation information. The objective of this method is to exploit the corre- lation between components of the feature vector and the tempo- ral correlation between consecutive frames of each component. The recognition experiments, including results in the Aurora 2, Aurora 3-Spanish and Aurora 3-Italian databases, demonstrate that the use of this correlation information increases the recog- nition accuracy. Index Terms: Distributed Speech Recognition, noise compen- sation, histogram equalization, correlation information 1. Introduction At present, the voice communication systems tend to take pro- gressively away from the analogical world toward the digital world. Cellular phones and voice over IP (VoIP) services work in this technology, where the analogical voice signal is digi- tized before transmitting it. This digital processing allows to implement more and more complex functions that meet new ne- cessities, such as the automatic speech recognition (ASR). This function can be very useful in those tasks which have tradition- ally been accomplished via buttons, but it also opens the doors to new services. In practice, the implementation of an ASR system on every client’s terminal can be unviable. The devices should have enough storage and processing ability to perform the whole ASR process, and this isn’t always possible. Distributed Speech Recognition (DSR) appears to solve this problem, because the ASR system is distributed between the client and server. In this client-server architecture, the feature extraction of speech is performed locally at the client, where they are compressed and transmitted to a remote server, where the recognition sys- tem is implemented. The speech features used are based on the Mel Frequency Cepstral Coefﬁcients (MFCC) [1], which are the most com- monly used parameters in currently available speech recogni- tion systems. Their use achieves very high level of accuracy in clean speech environment, but results decrease quickly if the voice signal is affected by additive noise. This is because the speech recognition systems are generally trained with speech acquired under clean conditions and this doesn’t model speech acquired under noisy conditions accurately. Additive noise causes nonlinear distortion on coefﬁcients value space and we have to use some compensation method to minimize this effect. In [2] and [3], MFCCs are compressed by using linear prediction and in [4], [5] and [6] DCT and 2D DCT is used. Histogram Equalization has been studied in [7] and [8], in order to improve the robustness of speech recognition sys- tems. Other approaches have also been proposed (see for exam- ple [9]) that differ in the domain of application of HEQ. In [10], the authors show that the information of interframe correlation is very useful to improve the recognition. In this paper, we propose a noise compensation method based on histogram equalization. This equalization is based on the hypothesis that, sorting the local coefﬁcient values of the current frame, the position of the current frame in this order doesn’t change signiﬁcantly when the speech signal is affected by an additive noise. In other words, although noise changes all individual coefﬁcient values, their local order statistics remain similar. This can be represented as a histogram, or cumulative distribution function. Moreover, in order to exploit existing correlation between coefﬁcients, it is logical to use a histogram-based vector quan- tization to quantize together each pair of MFCC parameters, as it is exposed in [8]. Additionally, another implicit information exists in MFCC values and it can be used to improve the quanti- zation. It is the temporal correlation, or interframe correlation, between values of each coefﬁcient. This is the main contribu- tion of this paper. In a ﬁrst step, we prove that this information by itself improves the quantization, since it increases the recog- nition accuracy. In a second step, we propose a method that uses both correlations (temporal correlation and correlation between coefﬁcients), in order to improve the recognition performance as much as possible using all the information available. The layout of this paper is as follows: in Section 2 the quantization method is described detailedly; Section 3 shows and discuss the results obtained with the Aurora 2, Aurora 3- Spanish and Aurora 3-Italian databases; ﬁnally, in Section 4 we expose the conclusions. 2. Description of the Method As it has been commented, the proposed equalization is based on the hypothesis that, for each coefﬁcient, the position that a frame has inside a sorted list of the values of its local frames isn’t signiﬁcantly changed by the presence of noise. Graphi- cally, it can be shown as a histogram or a cumulative distribu- tion function created by sorting the values of N frames around the current one. An example is shown in Figure 1, where we suppose that we have N frames from a clean utterance, and the same frames if some noise is added to the voice signal. Graphic F1(x) represents the cumulative distribution function of the N values from the clean utterance, and F2(x) match with the noisy utterance. According to this, if we only can see noisy values and the current frame has the value x2, the best estimation we can