978-1-5090-2906-8/16/$31.00 ©2016 IEEE Performance Comparison of MFCC Based Bangla ASR System in Presence and Absence of Third Differential Coefficients Sudipto Debnath, Fatema-E-Jannat, Susmita Saha, Mohammad Tarik Aziz, Rifayet Hasan Sajol, and Md. Jakaria Rahimi Department of Electrical and Electronic Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh s.dnath91@gmail.com, f.jannat29@gmail.com, susmita.eee34@gmail.com, imran1496@gmail.com, rifayet.014@gmail.com, mjrahimi@gmail.com Abstract—Present Mel Frequency Cepstral Coefficient (MFCC) based Bangla Automatic Speech Recognition (ASR) systems are mostly implemented with delta and acceleration coefficients. With delta and acceleration coefficients of MFCC and the log energy, a vector set of 39 dimensions is obtained per 10ms. In this paper, our objective is to observe the effect of third differential coefficients on the performance of Bangla ASR, which is not explored in this field yet. In doing so, we have appended 13 third differential coefficients along with previous 39 coefficients to make a vector set of 52 coefficients per 10ms frame. We have observed the performance of Bangla ASR system in the presence and absence of third differential coefficients using Hidden Markov Model (HMM) based tied-state triphone model. To make the speech corpus, 100 sentences have been uttered by a different number of speakers at different phases including both male and female of similar ages in between 22-24. Hidden- Markov-Model Toolkit (HTK) has been used here for the comparative analysis. We have considered the Sentence Correction Rate (SCR) as the performance indicator. From the experiments, it has been observed that the MFCC based system of 39 (MFCC39) and 52 (MFCC52) dimensions have average SCR of 98.89% and 98.94% respectively. Therefore, our finding is that slight improvement is possible with the inclusion of third differential coefficients when the sampling data rate is as high as 44.1 KHz. Keywords—MFCC39; MFCC52; Bangla; third differential coefficients; HMM I. INTRODUCTION ASR is a remarkably swift emerging application of natural language technology that perceives the spoken speeches by computer or computerized devices. For developing and ameliorating the speech recognition system numerous experiments have been done in different languages. Despite the advancement of speech recognition technology around the world, the central of the attraction of developing ASR was always for the English language. There is an insignificant amount of research on Bangla ASR although it is the seventh most spoken language in the world [1]. Therefore, we have taken an attempt for improving Bangla Speech Recognizer. We have developed ASR for Bangla that employs Mel Frequency Cepstral Coefficients (MFCCs). Several techniques are used for feature extraction. Among them, we have chosen MFCC as MFCC computes the features on Mel-scale. Moreover, Mel-scale relates to human hearing as human ear responds to the varying intensity of sound in a logarithmic manner and Mel-scale is logarithmic above 1 KHz. We have observed the performance of both acceleration coefficients and third differential coefficients using HMM-based tied-state triphone model. HMM is a statistical approach that originates stochastic models from known utterances which are compared with the probability of that the unknown utterance developed by each model [2]. To achieve higher accuracy on HMM-based phonetic segmentation, an acoustic model based on tied-state triphone has been employed here as it is the most effective model for capturing the co-articulation effect. The immediate left and right phonetic contexts have been considered here. HTK toolkit has been used here with the help of MATLAB for extracting the feature, building the HMMs, and generating the results. We have organized this paper into six sections including this introduction section. Section II presents a general review of some previous work in the field of ASR using MFCC with acceleration coefficients. In section III, the methodology is briefly described. Section IV provides the experimental setup. The experimental results are discussed in section V. Conclusion and an overview of future work are given in section VI. II. LITERATURE REVIEW Researchers have performed a lot of experiments in the field of ASR with different approaches to obtain acceptable accuracy of speech recognition.