A Comparative Study of Wavelet Based Feature Extraction Techniques in Recognizing Isolated Spoken Words Sonia Sunny, David Peter S., and K Poulose Jacob Cochin University of Science and Technology, Kochi, India Email: sonia.deepak@yahoo.co.in; davidpeter@cusat.ac.in; kpj@cusat.ac.in Abstract—Speech is a natural mode of communication for people and speech recognition is an intensive area of research due to its versatile applications. This paper presents a comparative study of various feature extraction methods based on wavelets for recognizing isolated spoken words. Isolated words from Malayalam, one of the four major Dravidian languages of southern India are chosen for recognition. This work includes two speech recognition methods. First one is a hybrid approach with Discrete Wavelet Transforms and Artificial Neural Networks and the second method uses a combination of Wavelet Packet Decomposition and Artificial Neural Networks. Features are extracted by using Discrete Wavelet Transforms (DWT) and Wavelet Packet Decomposition (WPD). Training, testing and pattern recognition are performed using Artificial Neural Networks (ANN). The proposed method is implemented for 50 speakers uttering 20 isolated words each. The experimental results obtained show the efficiency of these techniques in recognizing speech. Index Terms—speech recognition, feature extraction, discrete wavelet transforms, wavelet packet decomposition, classification, artificial neural networks. I. INTRODUCTION Automatic speech recognition (ASR) is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of text and is a broad area of research. Since speech is the primary means of communication between people, research in automatic speech recognition and speech synthesis by machine has attracted a great deal of attention over the past five decades [1]. The human vocal tract and articulators are biological organs with nonlinear properties, which is affected by factors ranging from gender to upbringing to emotional state. As a result, there is variation in terms of their accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed. Speech patterns can be further distorted by background noise and echoes, as well as electrical characteristics [2]. All these sources of variability make speech recognition a very complex problem. The performance of a speech recognition system is measurable and the most widely used method for measuring the performance is calculating the recognition accuracy. Many parameters affect the accuracy of the speech recognition system. Wavelets and wavelet packet analysis are found to be very effective in speech processing and signal processing applications. Speech processing and recognition are intensive areas of research due to the wide variety of applications like mobile applications, weather forecasting, agriculture, healthcare, automatic translation, robotics, video games, transcription etc [3]. Speech recognition system generally carries some kind of classification or recognition based upon speech features which are usually obtained via time- frequency representations [4]. There are different phases for speech recognition using a computer. In this work, the speech recognition process is divided into 3 modules namely, creating the speech database, feature extraction and classification. Among these stages, the front end processing part called the feature extraction is a key, because better feature is good for improving recognition rate. While designing a wavelet based speech recognition system, the main issue is to choose the optimal wavelets and the appropriate number of decomposition levels. Likewise, the number of hidden layers and the number of hidden neurons chosen plays an important role in obtaining good recognition accuracy using neural networks. The outline of this study is as follows. The spoken isolated words database is explained in section 2. In the subsequent section, the theory of feature extraction is reviewed followed by the concepts of discrete wavelet transforms and wavelet packet decomposition used during this stage. Section 4 describes the classification stage using artificial neural networks. Section 5 presents the detailed analysis of the experiments done and the results obtained. Conclusions are given in the last section. II. ISOLATED WORDS DATABASE FOR MALAYALAM A database is created for Malayalam language using 50 speakers. Each speaker utters 20 words. We have used twenty male speakers and thirty female speakers for creating the database. The speech samples are taken from speakers of age between 20 and 30. The samples stored in the database are recorded by using a high quality studio- recording microphone at a sampling rate of 8 KHz (4 KHz band limited). Recognition has been made on these