SPEECH ENHANCEMENT USING INTRA-FRAME DEPENDENCY IN DCT DOMAIN Achintya Kundu, Saikat Chatterjee and T.V. Sreenivas Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore, India 560012. phone: +91 80 2360 2167, Fax: +91 80 2360 0683. email: {achintya, saikat, tvsree}@ece.iisc.ernet.in ABSTRACT In this paper, we present a new speech enhancement ap- proach, that is based on exploiting the intra-frame depen- dency of discrete cosine transform (DCT) domain coefﬁ- cients. It can be noted that the existing enhancement tech- niques treat the transform domain coefﬁcients independently. Instead of this traditional approach of independently pro- cessing the scalars, we split the DCT domain noisy speech vector into sub-vectors and each sub-vector is enhanced in- dependently. Through this sub-vector based approach, the higher dimensional enhancement advantage, viz. non-linear dependency, is exploited. In the developed method, each clean speech sub-vector is modeled using a Gaussian mix- ture (GM) density. We show that the proposed Gaussian mixture model (GMM) based DCT domain method, using sub-vector processing approach, provides better performance than the conventional approach of enhancing the transform domain scalar components independently. Performance im- provement over the recently proposed GMM based time do- main approach is also shown. 1. INTRODUCTION Estimation of clean speech signal from noise corrupted speech is a challenging problem with applications in voice communication systems, automatic speech recognition sys- tems, hearing aids, etc. Enhancement of noisy speech sig- nal is generally carried out using statistical models of clean speech and noise. Existing approaches include spectral sub- traction [1], Wiener ﬁltering [2], Bayesian estimation ap- proach in transform domain [3], hidden Markov model based methods [4], subspace based approach [5], etc. In this paper, we take the minimum mean square error (MMSE) estimation approach for speech enhancement (SE) in DCT domain. For the traditional transform based SE methods [3], [6], [7], [8], the ubiquitous Gaussian density is used for model- ing the probability densities of transform domain speech and noise coefﬁcients. In the literature [9], it has been shown that the probability density function (PDF) of speech sig- nal in signal/transform domain is non-Gaussian in nature. In [10], a DCT domain speech enhancement method is pro- posed based on modeling the PDF of clean speech DCT coef- ﬁcients using Laplacian density. Among other non-Gaussian PDFs, Gamma distribution (family of super Gaussian densi- ties) has been used in DFT/KLT domain [11]- [14]. In our re- cent work [15], we also have noted the importance of model- ing the time domain speech coefﬁcients using non-Gaussian PDF; we have modeled the joint PDF of time domain speech samples using GMM. It is mentioned that the GMM has been used earlier in speech enhancement to model the PDF of each short-time spectral component of speech [16],[17]. In transform domain, we note that the existing MMSE estimation based methods [3], [7], [8], [10], [13], enhance the transform domain coefﬁcients of noisy speech individ- ually, i.e., scalar processing is employed in the estimation process assuming the coefﬁcients are independent. This ap- proach will provide optimum performance if the respective joint PDFs of clean speech vector and noise vector can be ef- fectively modeled using multivariate Gaussian densities (as the de-correlating transform makes the transform domain components independent). For signals with non-Gaussian PDF, there exists no linear transform which provides inde- pendent scalar components in transform domain. Thus, the transform-domain MMSE estimation method of enhancing scalar components independently leads to suboptimal perfor- mance for non-Gaussian PDF based signal, such as speech signal. To recover this performance loss, we investigate the approach of processing the sub-vectors in transform domain; the use of higher dimensional sub-vectors allows us to exploit the non-linear dependency which is otherwise not possible using scalar domain processing. In the developed method, the noisy speech signal vector is transformed using DCT and the DCT vector is split into sub-vectors; the sub-vectors are enhanced using the MMSE estimator. We have found that the new approach provides better performance than the con- ventional approach of enhancing the transform domain scalar components independently. Also, the new method has shown signiﬁcant performance improvement over the recently pro- posed GMM based time domain method [15]. 2. PROPOSED METHOD We consider single-channel noisy speech signal as input to the speech enhancement system. Using additive model of speech signal degradation in noisy environment, input noisy speech signal can be written as y (n)= x (n)+ w (n) , n = 0, 1, 2,..., (1) where y (n), x (n) and w (n) are respectively nth sample of noisy speech signal, clean speech signal and additive noise. Speech enhancement system processes the sequence of noisy speech samples as overlapping frames, where each frame contains K consecutive samples and successive frames are shifted by R samples. We deﬁne t th noisy speech vector as y (t )=[ y t (0) y t (1) ... y t (K − 1)] T , t = 0, 1,..., where y t (n)= y (tR + n). Now, the noisy speech model of Eqn. (1) can be written in vector notation as y (t )= x (t )+ w (t ) , where x (t ) and w ( t ) are K × 1 vectors of clean speech and noise respectively corresponding to the noisy obser- vation vector y (t ). Denoting the K × K DCT matrix by D, we deﬁne noisy speech vector, clean speech vector and 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP