International Journal of Computer Applications (0975 – 8887) Volume 125 – No.6, September 2015 15 Digit Recognition based on Euclidean and DTW Sreeja Nair EXTC Department. FCRIT Vashi-400703, Navi Mumbai, India sreejan791@gmail.com Milind Shah EXTC Department. FCRIT Vashi-400703, Navi Mumbai, India milind05in@yahoo.co.in ABSTRACT This paper describes the implementation of two isolated digit recognition techniques and is a comparison between the algorithms implemented. Any digit recognition comprises of mainly two stages feature extraction and similarity evaluation. Here, two feature extraction techniques, namely linear predictive cepstral coefficients (LPCC) and mel frequency cepstral coefficients (MFCC) are implemented and the similarity evaluation is done using Euclidean distance and Dynamic Time Warping (DTW). In DTW both single and averaged template matching is done. The results obtained for these algorithms are perused, compared and conclusions are drawn. Keywords Digit recognition, linear predictive cepstral coefficients, mel frequency cepstral coefficients, euclidean distance, dynamic time warping. 1. INTRODUCTION Speech recognition is a process by which a computer recognizes a human speech and converts it into text. In particular, speech recognition for spoken digits finds a wide variety of applications. Some of them are banking by voice, data input to a computer, hands off and eyes off number dialing in mobiles, etc [1]. In practice speech recognition algorithms are complex due to inter speaker variations as well as intra speaker variations. Inter speaker variation is the difference in the same speech from person to person in terms of pronunciation, accent, etc. whereas intra speaker variability is the difference in utterance of speech by the same person. This is because humans can never produce words exactly the same way twice [2]. Moreover other factors such as slang, dialect, accent, etc are responsible for further variation of speech between speakers. Speech recognition involves four steps namely, pre-processing, feature extraction, similarity evaluation and decision making [3]. Pre-processing is to prepare the signal for further processing. Pre-emphasis, end-point detection, etc are carried out in this stage. Feature extraction and similarity evaluation are the most important steps amongst all. Since speech is highly redundant, it is impractical to process, store and transmit the signal as it is. Hence a speech signal is represented in terms of a few number of parameters. There are different parameter or feature extraction techniques for speech recognition like LPC, LPCC, MFCC, PLP, etc. which are implemented by various researchers. In [1], Rabiner presents an initial implementation of digit recognition using parameters like LPC, log energy, zero crossing rate, etc. Atal in [4], has used LPCC for speaker recognition. He has also introduced the concept of frame wise averaging the coefficients of LPCC, which has slightly increased the accuracy of recognition [This averaging method has been used in this paper.] Similarly, MFCC based feature extraction has been carried out in [5] and Perception Linear Prediction (PLP) and Euclidean distance based speech recognition has been implemented in [6]. Once the features are extracted for a given signal, they have to be compared with the feature of the references stored which depends on the vocabulary of the recognition system. Similarity evaluation can be done using template based techniques like Euclidean distance and DTW or network models like Hidden Markov Models (HMM) and Neural Networks (NN). There are two digit recognition techniques implemented in this paper. In the first method, the two feature extraction techniques are implemented and the feature vectors are compared using Euclidean distance whereas in the second method the same feature extraction techniques are compared using DTW. Section II describes the feature extraction techniques whereas Section III gives details about the similarity evaluation techniques. Section IV explains the implementation and results obtained are perused in Section V 2. FEATURE EXTRACTION TECHNIQUES 2.1 Linear Predictive Cepstral Coefficients(LPCC) Linear prediction refers to predicting the present speech sample using the past samples. The predicted value is given by (1) [1,2]: p k k k n n a s s 1 (1) where a k are the prediction coefficients and s n-k are the previous samples used to obtain the present sample ŝ n The prediction coefficients are obtained by minimizing the prediction error in the least squares sense using autocorrelation. The order or the total number of prediction coefficients is denoted by p. If G is the gain of LPC then, the cepstral coefficients C m are obtained from the LPC parameters using (2), (3) and (4)[2] The block diagram of LPCC computation [2] is as shown in Fig.1 Fig.1 The block diagram to find Linear Predictive Cepstral Coefficients [2] The speech signal recorded is sampled and the end point detection is done to remove the silence from the speech using both the short time energy and zero crossing rate as implemented (4) , ) ( (3) 1 , ) ( 1 (2) ) ln( 1 ) ( 1 ) ( 0 n m p C a m k m C p m C a k m m a C G C p k k m k m m k k m k m m