A class-specific speech enhancement for phoneme recognition: a dictionary learning approach Nazreen P.M. 1 , A. G. Ramakrishnan 1 , Prasanta Kumar Ghosh 2 1 Medical Intelligence and Language Engineering (MILE) Laboratory. 2 Signal Processing Interpretation and Representation (SPIRE) Laboratory. Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India. {nazreenpm, ramkiag, prasantg} @ee.iisc.ernet.in Abstract We study the influence of using class-specific dictionaries for enhancement over class-independent dictionary in phoneme recognition of noisy speech. We hypothesize that, using class- specific dictionaries would remove the noise more compared to a class-independent dictionary, thereby resulting in better phoneme recognition. Experiments are performed with speech data from TIMIT corpus and noise samples from NOISEX- 92 database. Using KSVD, four types of dictionaries have been learned: class-independent, manner-of-articulation-class, place-of-articulation-class and 39 phoneme-class. Initially, a set of labels are obtained by recognizing the speech, enhanced using a class-independent dictionary. Using these approximate labels, the corresponding class-specific dictionaries are used to enhance each frame of the original noisy speech, and this en- hanced speech is then recognized. Compared to the results ob- tained using the class-independent dictionary, the 39 phoneme- class based dictionaries provide a relative phoneme recognition accuracy improvement of 5.5%, 3.7%, 2.4% and 2.2%, respec- tively for factory2, m109, leopard and babble noises, when av- eraged over 0, 5 and 10 dB SNRs. Index Terms: speech enhancement, robust speech recognition, sparse coding, dictionary learning, phoneme recognition. 1. Introduction In the past decade, there has been tremendous improvements in the field of automatic speech recognition (ASR). Despite these, the performance of an ASR system degrades significantly in the presence of noise due to the mismatch between the training and test environments, for example, when training is done on clean speech and testing is performed on noisy speech. The presence of noise distorts the spectrum of speech and hence degrades the performance. Several techniques have been proposed to address this prob- lem, and improve the recognition performance in noisy envi- ronments. One such method is to employ model adaptation schemes, like parallel model combination [1] and HMM adap- tation [2, 3, 4]. Another approach is to analyze the existing fea- tures and enhance them to make them more noise robust, like cepstral mean subtraction [5], RASTA filtering [6] and vector Taylor series [7]. A third approach is to enhance the speech as a front end processing, using methods such as spectral subtraction [8] or Wiener filtering [9] before it is fed into a recognizer. This obviates the need to retrain the ASR systems for different types of noisy inputs since the same ASR trained on clean speech can be used. A comparative study [10] has also been reported on the performance of ASR system with various enhancement ap- proaches. Recently, sparse coding techniques have gained pop- ularity. A speech enhancement scheme based on sparse coding has been proposed by Sigg et al. [11], who show that it performs better than techniques like geometric spectral subtraction[12]. Several exemplar based techniques [13, 14] have also been pro- posed in the past for robust speech recognition. In sparse coding, the basic assumption is that we can rep- resent structured signals like speech as sparse linear combina- tions of prototype vectors or basis. Speech signal is composed of several sounds which can be categorized in various ways, like manner-of-articulation (MOA) [15], place-of-articulation (POA) [16, 17] or phonemes (PHN). Some of these classes might correlate well with certain noise types more than the other classes. Hence the bases in a dictionary learned using these classes may represent noise power to varying degrees and con- sequently result in poor speech reconstruction. By removing the contribution from bases of these classes that correlate well with noise, one could improve the enhancement performance. One way to achieve this is to learn different dictionaries for dif- ferent classes and intelligently select a particular dictionary for a segment. Raj et al. [18] propose a similar approach, where they use phoneme-dependent non-negative matrix factorization (NMF) for separation of music from speech. In this work, we extend their idea to sparse coding to analyze how, using class- specific dictionaries, the performance of an ASR system could be improved over that obtained using a dictionary learned in a class-independent manner. Wang et al [19] investigated the use of class-specific, ideal ratio mask estimation for speech en- hancement. But the recognizer used as well as the mask esti- mator are trained using noisy speech. However, we consider a more realistic scenario where the noise level is not known a- priori and a recognizer trained on clean speech is used. 2. Enhancement using learned dictionary Under additive model, noisy speech can be represented as, yt (m)= st (m)+ nt (m) (1) where yt (m), st (m) and nt (m) are the m th samples of the time domain noisy speech, clean speech and noise, respectively. Considering the short time Fourier transform (STFT), y(ω k )= s(ω k )+ n(ω k ) (2) where ω k = 2πk R , k =0, 1, 2...R − 1 , R is the number of frequency bins and k is the index. Taking the magnitude STFT, the noisy speech can be approximated as y ≈ s + n ∈ R R×1 , where s and n represent the spectra of the clean speech and the noise, respectively. An estimate of the STFT of the noisy speech is given by ˆ y = Ds × cs + Dn × cn (3) Copyright 2016 ISCA INTERSPEECH 2016 September 8–12, 2016, San Francisco, USA http://dx.doi.org/10.21437/Interspeech.2016-236 3728