IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 11, NOVEMBER 2010 957
Data-Driven and Feedback Based Spectro-Temporal
Features for Speech Recognition
G. S. V. S. Sivaram, Student Member, IEEE, Sridhar Krishna Nemala, Student Member, IEEE, Nima Mesgarani, and
Hynek Hermansky, Fellow, IEEE
Abstract—This paper proposes novel data-driven and feedback
based discriminative spectro-temporal filters for feature extrac-
tion in automatic speech recognition (ASR). Initially a first set of
spectro-temporal filters are designed to separate each phoneme
from the rest of the phonemes. A hybrid Hidden Markov Model/
Multilayer Perceptron (HMM/MLP) phoneme recognition system
is trained on the features derived using these filters. As a feed-
back to the feature extraction stage, top confusions of this system
are identified, and a second set of filters are designed specifically
to address these confusions. Phoneme recognition experiments on
TIMIT show that the features derived from the combined set of
discriminative filters outperform conventional speech recognition
features, and also contain significant complementary information.
Index Terms—Confusion analysis, discriminative filters,
spectro-temporal features, speech recognition.
I. INTRODUCTION
I
T is well known that the information about speech sounds,
such as phonemes, is encoded in the spectro-temporal dy-
namics of speech. Conventional automatic speech recognition
(ASR) features encode either the spectral or the temporal vari-
ations of the spectro-temporal pattern. These features, typically
extracted over a time scale of the order of few hundred mil-
liseconds, are transformed to posterior probability estimates of
various phone classes in multilayer perceptron (MLP) based
acoustic modeling.
Though effectively MLP receives a context of several hun-
dred milliseconds at its input layer, recently, there has been an
increased research effort in deriving the features that explicitly
capture the joint spectro-temporal dynamics of the speech. Such
an approach is primarily motivated by the spectro-temporal re-
ceptive field (STRF) model for predicting the response of a cor-
tical neuron to the input speech, where STRF describes the 2-D
spectro-temporal pattern to which the neuron is most responsive
to [1].
Manuscript received January 14, 2010; revised August 24, 2010; accepted
September 08, 2010. Date of publication September 27, 2010; date of current
version October 04, 2010. The associate editor coordinating the review of this
manuscript and approving it for publication was Dr. Pascale Fung.
G. S. V. S. Sivaram and H. Hermansky are with the ECE Department, the
Center for Language and Speech Processing, and the Human Language Tech-
nology, Center of Excellence, Johns Hopkins University, Baltimore, MD 21218
USA e-mail: sivaram@jhu.edu; hynek@jhu.edu).
S. K. nemala and N. Mesgarani are with the ECE Department and the Center
for Language and Speech Processing, Johns Hopkins University, Baltimore, MD
21218 USA e-mail: nemala@jhu.edu; nmesgara1@jhu.edu).
Digital Object Identifier 10.1109/LSP.2010.2079930
Most of the works, so far, have used the parametric 2-D Gabor
filters for extracting features. The parameters of the Gabor func-
tions are either selected using the data [2], [3] or preselected to
form the various streams of information [4]. Even though mul-
tiple spectro-temporal feature streams were formed and com-
bined using MLPs in [4], it is difficult to interpret what each
feature stream is trying to achieve.
Previously, we proposed a feature extraction using a set of
2-D filters to discriminate each phoneme from the rest of the
phonemes [5]. In other words, 2-D filters used to extract features
are learned from the data in a discriminative way. A hybrid
Hidden Markov Model/Multilayer Perceptron (HMM/MLP)
phoneme recognition system is trained on the features derived
using these filters.
In this paper, we introduce feedback to the feature extrac-
tion stage by analyzing the confusion matrix of the recogni-
tion system and designing an additional set of filters that are
optimized to discriminate the most confused phoneme pairs.
Phoneme recognition experiments on TIMIT database show sig-
nificant improvement when the features derived using these ad-
ditional set of filters are combined at the posterior level (output
of the MLP) with that of the earlier system ([5]) using the Demp-
ster-Shafer (DS) theory of evidence [6]. Also, we show that the
proposed discriminative spectro-temporal features capture sig-
nificant complementary information to both the spectral (PLP,
[7]) and the temporal (MRASTA, [8]) features.
II. FEATURE EXTRACTION
Speech is represented in the spectro-temporal domain (log
critical band energies) for both learning the 2-D filter shapes
and extracting the features. This representation is obtained by
first performing a Short Time Fourier Transform (STFT) on the
speech signal with an analysis window of length 25 ms and
a frame shift of 10 ms. The magnitude square values of the
STFT output are then projected on a set of frequency weights
which are equally spaced on the Bark frequency scale to ob-
tain the spectral energies in various critical bands. Finally, the
spectro-temporal representation is obtained by applying the log-
arithm on these critical band energies. The block schematic of
the proposed feature extraction is shown in the Fig. 1, which is
described below.
A. Design of MLDA One-vs-Rest 2-D Filters
Labels for the TIMIT training data are obtained by mapping
61 hand-labeled symbols to a standard set of 39 phonemes
[9]. A set of spectro-temporal patterns corresponding to each
phoneme is obtained from the spectro-temporal representation
1070-9908/$26.00 © 2010 IEEE