IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 11, NOVEMBER 2010 957 Data-Driven and Feedback Based Spectro-Temporal Features for Speech Recognition G. S. V. S. Sivaram, Student Member, IEEE, Sridhar Krishna Nemala, Student Member, IEEE, Nima Mesgarani, and Hynek Hermansky, Fellow, IEEE Abstract—This paper proposes novel data-driven and feedback based discriminative spectro-temporal ﬁlters for feature extrac- tion in automatic speech recognition (ASR). Initially a ﬁrst set of spectro-temporal ﬁlters are designed to separate each phoneme from the rest of the phonemes. A hybrid Hidden Markov Model/ Multilayer Perceptron (HMM/MLP) phoneme recognition system is trained on the features derived using these ﬁlters. As a feed- back to the feature extraction stage, top confusions of this system are identiﬁed, and a second set of ﬁlters are designed speciﬁcally to address these confusions. Phoneme recognition experiments on TIMIT show that the features derived from the combined set of discriminative ﬁlters outperform conventional speech recognition features, and also contain signiﬁcant complementary information. Index Terms—Confusion analysis, discriminative ﬁlters, spectro-temporal features, speech recognition. I. INTRODUCTION I T is well known that the information about speech sounds, such as phonemes, is encoded in the spectro-temporal dy- namics of speech. Conventional automatic speech recognition (ASR) features encode either the spectral or the temporal vari- ations of the spectro-temporal pattern. These features, typically extracted over a time scale of the order of few hundred mil- liseconds, are transformed to posterior probability estimates of various phone classes in multilayer perceptron (MLP) based acoustic modeling. Though effectively MLP receives a context of several hun- dred milliseconds at its input layer, recently, there has been an increased research effort in deriving the features that explicitly capture the joint spectro-temporal dynamics of the speech. Such an approach is primarily motivated by the spectro-temporal re- ceptive ﬁeld (STRF) model for predicting the response of a cor- tical neuron to the input speech, where STRF describes the 2-D spectro-temporal pattern to which the neuron is most responsive to [1]. Manuscript received January 14, 2010; revised August 24, 2010; accepted September 08, 2010. Date of publication September 27, 2010; date of current version October 04, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Pascale Fung. G. S. V. S. Sivaram and H. Hermansky are with the ECE Department, the Center for Language and Speech Processing, and the Human Language Tech- nology, Center of Excellence, Johns Hopkins University, Baltimore, MD 21218 USA e-mail: sivaram@jhu.edu; hynek@jhu.edu). S. K. nemala and N. Mesgarani are with the ECE Department and the Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD 21218 USA e-mail: nemala@jhu.edu; nmesgara1@jhu.edu). Digital Object Identiﬁer 10.1109/LSP.2010.2079930 Most of the works, so far, have used the parametric 2-D Gabor ﬁlters for extracting features. The parameters of the Gabor func- tions are either selected using the data [2], [3] or preselected to form the various streams of information [4]. Even though mul- tiple spectro-temporal feature streams were formed and com- bined using MLPs in [4], it is difﬁcult to interpret what each feature stream is trying to achieve. Previously, we proposed a feature extraction using a set of 2-D ﬁlters to discriminate each phoneme from the rest of the phonemes [5]. In other words, 2-D ﬁlters used to extract features are learned from the data in a discriminative way. A hybrid Hidden Markov Model/Multilayer Perceptron (HMM/MLP) phoneme recognition system is trained on the features derived using these ﬁlters. In this paper, we introduce feedback to the feature extrac- tion stage by analyzing the confusion matrix of the recogni- tion system and designing an additional set of ﬁlters that are optimized to discriminate the most confused phoneme pairs. Phoneme recognition experiments on TIMIT database show sig- niﬁcant improvement when the features derived using these ad- ditional set of ﬁlters are combined at the posterior level (output of the MLP) with that of the earlier system ([5]) using the Demp- ster-Shafer (DS) theory of evidence [6]. Also, we show that the proposed discriminative spectro-temporal features capture sig- niﬁcant complementary information to both the spectral (PLP, [7]) and the temporal (MRASTA, [8]) features. II. FEATURE EXTRACTION Speech is represented in the spectro-temporal domain (log critical band energies) for both learning the 2-D ﬁlter shapes and extracting the features. This representation is obtained by ﬁrst performing a Short Time Fourier Transform (STFT) on the speech signal with an analysis window of length 25 ms and a frame shift of 10 ms. The magnitude square values of the STFT output are then projected on a set of frequency weights which are equally spaced on the Bark frequency scale to ob- tain the spectral energies in various critical bands. Finally, the spectro-temporal representation is obtained by applying the log- arithm on these critical band energies. The block schematic of the proposed feature extraction is shown in the Fig. 1, which is described below. A. Design of MLDA One-vs-Rest 2-D Filters Labels for the TIMIT training data are obtained by mapping 61 hand-labeled symbols to a standard set of 39 phonemes [9]. A set of spectro-temporal patterns corresponding to each phoneme is obtained from the spectro-temporal representation 1070-9908/$26.00 © 2010 IEEE