E. Bayro-Corrochano and J.-O. Eklundh (Eds.): CIARP 2009, LNCS 5856, pp. 297–304, 2009. © Springer-Verlag Berlin Heidelberg 2009 Isolate Speech Recognition Based on Time-Frequency Analysis Methods Alfredo Mantilla-Caeiros 1 , Mariko Nakano Miyatake 2 , and Hector Perez-Meana 2 1 Intituto Tecnologico de Monterrey, Campus Ciudad de Mexco, Av. Del Puente Mexico D.F. 2 ESIME Culhuacan, Instituto Politécnico Nacional, Av. Santa Ana 1000, 04430 Mexico D.F. Mexico amantill@itesm.mx, mariko@infinitum.com.mx, hmpm@prodigy.net.mx Abstract. A feature extraction method for isolate speech recognition is pro- posed, which is based on a time frequency analysis using a critical band concept similar to that performed in the inner ear model; which emulates the inner ear behavior by performing signal decomposition, similar to carried out by the basi- lar membrane. Evaluation results show that the proposed method performs bet- ter than other previously proposed feature extraction methods when it is used to characterize normal as well as esophageal speech signal. Keywords: Feature extraction, inner ear model; isolate speech recognition, time-frequency analysis. 1 Introduction The performance of any speech recognition algorithm strongly depends on the accu- racy of the feature extraction method, because of that several methods have been proposed in the literature to estimate a set of parameters that allows a robust charac- terization of the speech signal. A widely used feature extraction method consists on applying the Fast Fourier Transform (FFT) to the speech segment under analysis. This representation in the frequency domain is obtained by using the well-known MEL scale, where the frequencies smaller than 1kHz are analyzed using a linear scale, while the frequencies larger than 1kHz are analyzed using a logarithmic scale, with the purpose of creating an analogy with the internal cochlea of the ear that works as a frequencies splitter [1]-[4]. Linear Predictive Coding (LPC) is other widely used feature extraction method whose purpose is to find set of parameters that allows an accurate representation of the speech signal as the output of an all pole digital filter, which models the vocal track, whose excitation is an impulse sequence with a period equal to the pitch period of speech signal under analysis, when the speech segment is a voiced one, or a white noise when the speech segment is an unvoiced one [1], [3]. Here, to estimate the features vector, firstly the speech signal is divided in segments of 20 to 25 ms, with 50% of overlap. Finally, the linear predictive coefficients of each segment are