E. Bayro-Corrochano and J.-O. Eklundh (Eds.): CIARP 2009, LNCS 5856, pp. 297–304, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Isolate Speech Recognition Based on Time-Frequency
Analysis Methods
Alfredo Mantilla-Caeiros
1
, Mariko Nakano Miyatake
2
, and Hector Perez-Meana
2
1
Intituto Tecnologico de Monterrey, Campus Ciudad de Mexco,
Av. Del Puente Mexico D.F.
2
ESIME Culhuacan, Instituto Politécnico Nacional,
Av. Santa Ana 1000, 04430 Mexico D.F. Mexico
amantill@itesm.mx, mariko@infinitum.com.mx,
hmpm@prodigy.net.mx
Abstract. A feature extraction method for isolate speech recognition is pro-
posed, which is based on a time frequency analysis using a critical band concept
similar to that performed in the inner ear model; which emulates the inner ear
behavior by performing signal decomposition, similar to carried out by the basi-
lar membrane. Evaluation results show that the proposed method performs bet-
ter than other previously proposed feature extraction methods when it is used to
characterize normal as well as esophageal speech signal.
Keywords: Feature extraction, inner ear model; isolate speech recognition,
time-frequency analysis.
1 Introduction
The performance of any speech recognition algorithm strongly depends on the accu-
racy of the feature extraction method, because of that several methods have been
proposed in the literature to estimate a set of parameters that allows a robust charac-
terization of the speech signal. A widely used feature extraction method consists on
applying the Fast Fourier Transform (FFT) to the speech segment under analysis.
This representation in the frequency domain is obtained by using the well-known
MEL scale, where the frequencies smaller than 1kHz are analyzed using a linear
scale, while the frequencies larger than 1kHz are analyzed using a logarithmic scale,
with the purpose of creating an analogy with the internal cochlea of the ear that works
as a frequencies splitter [1]-[4].
Linear Predictive Coding (LPC) is other widely used feature extraction method
whose purpose is to find set of parameters that allows an accurate representation of
the speech signal as the output of an all pole digital filter, which models the vocal
track, whose excitation is an impulse sequence with a period equal to the pitch period
of speech signal under analysis, when the speech segment is a voiced one, or a white
noise when the speech segment is an unvoiced one [1], [3]. Here, to estimate the
features vector, firstly the speech signal is divided in segments of 20 to 25 ms, with
50% of overlap. Finally, the linear predictive coefficients of each segment are