AUTOMATIC SPEECH RECOGNITION BASED ON CEPSTRAL COEFFICIENTS AND A MEL-BASED DISCRETE ENERGY OPERATOR Hesham Tolba Douglas O’Shaughnessy INRS-T´ el´ ecommunications, Universit´ e du Qu´ ebec 16 Place du Commerce, Verdun ( ˆ Ile-des-Soeurs), Qu´ ebec, H3E 1H6, Canada tolba, dougo @inrs-telecom.uquebec.ca ABSTRACT In this paper, a novel feature vector based on both Mel Fre- quency Cepstral Coefﬁcients (MFCCs) and a Mel-based nonlinear Discrete-time Energy Operator (MDEO) is proposed to be used as the input of an HMM-based Automatic Continuous Speech Recog- nition (ACSR) system. Our goal is to improve the performance of such a recognizer using the new feature vector. Experiments show that the use of the new feature vector increases the recogni- tion rate of the ACSR system. The HTK Hidden Markov Model Toolkit was used throughout. Experiments were done on both the TIMIT and NTIMIT databases. For the TIMIT database, when the MDEO was included in the feature vector to test a multi-speaker ACSR system, we found that the error rate decreased by about . On the other hand, for NTIMIT, the MDEO deteriorates the performance of the recognizer. That is, the new feature vector is useful for clean speech but not for telephone speech. 1. INTRODUCTION In this paper, we introduce a novel combination of features to be used as the output of the front-end analyzer of an ACSR system. The new element that we combine with the MFCC coefﬁcients is the Teager Energy. Teager Energy calculation is based on the fact that the speech signal can be modeled as the sum of AM-FM signals. This model represents each component of the speech sig- nal as a signal with a combined amplitude modulation (AM) and frequency modulation (FM) structure. Hence, we can apply Tea- ger’s algorithm [1] for computing the energy of a signal. We use this algorithm as the basis for a new energy measure that replaces the traditional energy measure; it is used for a new time-frequency feature vector for speech recognition. The nonlinear energy operator ﬁrst developed by Kaiser [1] and its discrete-time counterpart have found several applications in the speech processing area [2],[3]. This discrete-time energy opera- tor is deﬁned as . In [2] it has been shown that, when the energy operator is applied to an AM-FM signal, it can approximately estimate the squared product of the amplitude and frequency sig- nals. Applying the energy operator to a speech signal to get the new En- ergy’s parameter of the feature vector came from the fact that the amplitude of the speech signal sample is always dependent on its frequency and that the traditional energy measure reﬂects only the amplitude of the signal, whereas the energy operator reﬂects the variations in both amplitude and frequency of the speech signal. This fact motivated us to include this element in the input feature vector to an automatic speech recognition system to enhance its performance. This paper will be organized into the following sections. The sec- ond section will present an introduction about the AM-FM Modu- lation Model, the DEO, spectral analysis, the cepstral coefﬁcients and the MFCCs. Following this, the third section will discuss how the MFCCs and the MDEO could be combined to be used as the input feature vector of an ACSR. Experimental results that demon- strate the effectiveness of adding the MDEO in the feature vector are presented in section . Finally, in section we conclude and discuss our present and future work. 2. BACKGROUND 2.1. AM-FM Modulation Model Motivated by several nonlinear and time-varying phenomena dur- ing speech production, Maragos, Quatieri and Kaiser [2] proposed an AM-FM modulation model that represents each single speech resonance (formant) as an AM-FM signal. This model represents each resonance of a speech signal as a signal with a combined am- plitude modulation (AM) and frequency modulation (FM) struc- ture. Then, the speech signal is modeled as the sum of such AM-FM signals, one for each formant, as follows: (1) where is the center value of the formant frequency, is the frequency modulating signal, and is the time-varying amplitude. The instantaneous formant frequency signal is . In the discrete-time domain the discrete-time AM-FM signal is deﬁned as (2)