International Journal of Computer Applications (0975 – 8887) Volume 15– No.8, February 2011 27 Speech Recognition by Wavelet Analysis Nitin Trivedi Asstt. Prof. Vidya College of Engg. Meerut Sachin Ahuja Asstt. Prof. Vidya College of Engg. Meerut Dr. Vikesh Kumar Director Vidya College of Engg. Meerut Raman Chadha Asstt. Prof. Vidya College of Engg. Meerut Saurabh Singh Asstt. Prof. Vidya College of Engg. Meerut ABSTRACT In an effort to provide a more efficient representation of the speech signal, the application of the wavelet analysis is considered. This research presents an effective and robust method for extracting features for speech processing. Based on the time‐frequency multi‐resolution property of wavelet transform, the input speech signal is decomposed into various frequency channels. The major issues concerning the design of this Wavelet based speech recognition system are choosing optimal wavelets for speech signals, decomposition level in the DWT, selecting the feature vectors from the wavelet coefficients. More specifically automatic classification of various speech signals using the DWT is described and compared using different wavelets. Finally, wavelet based feature extraction system and its performance on an isolated word recognition problem are investigated. For the classification of the words, three layered feed forward network is used. General Terms Dynamic Time Warping (DTW) Algorithm, Wavelet Transform (WT). Keywords Speech recognition, feature extraction, wavelet transform, Discrete Wavelet Transform (DWT). 1. INTRODUCTION Speech recognition is the process of automatically extracting and determining linguistic information conveyed by a speech signal using computers or electronic circuits. Automatic speech recognition methods, investigated for many years have been principally aimed at realizing transcription and human computer interaction systems. The first technical paper to appear on speech recognition has since then intensified the researches in this field, and speech recognizers for communicating with machines through speech have recently been constructed, although they remain only of limited use. Automatic speech recognition (ASR) features some of the following advantages: Speech input is easy to perform because it does not require a specialized skill as does typing or pushbutton operations. Information can be input even when the user is moving or doing other activities involving the hands, legs, eyes, or ears. Since a microphone or telephone can be used as an input terminal, inputting information is economical with remote inputting capable of being accomplished over existing telephone networks and the Internet. However, the task of ASR is difficult because: Lot of redundancy is present in the speech signal that makes discriminating between the classes difficult. Presence of temporal and frequency variability such as intra speaker variability in pronunciation of words and phonemes as well as inter speaker variability e.g. the effect of regional dialects. Context dependent pronunciation of the phonemes (co‐articulation). Signal degradation due to additive and convolution noise present in the background or in the channel Signal distortion due to non‐ideal channel characteristic. 2. SPEECH RECOGNITION Most speech recognition systems can be classified according to the following categories: 2.1 Speaker Dependent vs. Speaker Independent A speaker‐dependent speech recognition system is one that is trained to recognize the speech of only one speaker. Such systems are custom built for just a single person, and are hence not commercially viable. Conversely, a speaker‐independent system is one that is independence is hard to achieve, as speech recognition systems tend to become attuned to the speakers they are trained on, resulting in error rates that are higher than speaker dependent systems. 2.2 Isolated vs. Continuous In isolated speech, the speaker pauses momentarily between every word, while in continuous speech the speaker speaks in a continuous and possibly long stream, with little or no breaks in between. Isolated speech recognition systems are easy to build, as it is trivial to determine where one word ends and another starts, and each word tends to be more cleanly and clearly spoken. Words spoken in continuous speech on the other hand are subjected to the co-articulation effect, in which the pronunciation of a word is modified by the words surrounding it. This makes training a speech system difficult, as there may be many inconsistent pronunciations for the same word. 2.3 Keyword‐based vs. Sub‐word unit based A speech recognition system can be trained to recognized whole words, like dog or cat. This is useful in applications