978-1-5090-2906-8/16/$31.00 ©2016 IEEE
Performance Comparison of MFCC Based Bangla
ASR System in Presence and Absence of Third
Differential Coefficients
Sudipto Debnath, Fatema-E-Jannat, Susmita Saha, Mohammad Tarik Aziz, Rifayet Hasan Sajol, and Md. Jakaria
Rahimi
Department of Electrical and Electronic Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
s.dnath91@gmail.com, f.jannat29@gmail.com, susmita.eee34@gmail.com, imran1496@gmail.com, rifayet.014@gmail.com,
mjrahimi@gmail.com
Abstract—Present Mel Frequency Cepstral Coefficient
(MFCC) based Bangla Automatic Speech Recognition (ASR)
systems are mostly implemented with delta and acceleration
coefficients. With delta and acceleration coefficients of MFCC
and the log energy, a vector set of 39 dimensions is obtained per
10ms. In this paper, our objective is to observe the effect of third
differential coefficients on the performance of Bangla ASR,
which is not explored in this field yet. In doing so, we have
appended 13 third differential coefficients along with previous 39
coefficients to make a vector set of 52 coefficients per 10ms
frame. We have observed the performance of Bangla ASR system
in the presence and absence of third differential coefficients using
Hidden Markov Model (HMM) based tied-state triphone model.
To make the speech corpus, 100 sentences have been uttered by a
different number of speakers at different phases including both
male and female of similar ages in between 22-24. Hidden-
Markov-Model Toolkit (HTK) has been used here for the
comparative analysis. We have considered the Sentence
Correction Rate (SCR) as the performance indicator. From the
experiments, it has been observed that the MFCC based system
of 39 (MFCC39) and 52 (MFCC52) dimensions have average
SCR of 98.89% and 98.94% respectively. Therefore, our finding
is that slight improvement is possible with the inclusion of third
differential coefficients when the sampling data rate is as high as
44.1 KHz.
Keywords—MFCC39; MFCC52; Bangla; third differential
coefficients; HMM
I. INTRODUCTION
ASR is a remarkably swift emerging application of natural
language technology that perceives the spoken speeches by
computer or computerized devices. For developing and
ameliorating the speech recognition system numerous
experiments have been done in different languages. Despite the
advancement of speech recognition technology around the
world, the central of the attraction of developing ASR was
always for the English language. There is an insignificant
amount of research on Bangla ASR although it is the seventh
most spoken language in the world [1]. Therefore, we have
taken an attempt for improving Bangla Speech Recognizer.
We have developed ASR for Bangla that employs Mel
Frequency Cepstral Coefficients (MFCCs). Several techniques
are used for feature extraction. Among them, we have chosen
MFCC as MFCC computes the features on Mel-scale.
Moreover, Mel-scale relates to human hearing as human ear
responds to the varying intensity of sound in a logarithmic
manner and Mel-scale is logarithmic above 1 KHz. We have
observed the performance of both acceleration coefficients and
third differential coefficients using HMM-based tied-state
triphone model. HMM is a statistical approach that originates
stochastic models from known utterances which are compared
with the probability of that the unknown utterance developed
by each model [2]. To achieve higher accuracy on HMM-based
phonetic segmentation, an acoustic model based on tied-state
triphone has been employed here as it is the most effective
model for capturing the co-articulation effect. The immediate
left and right phonetic contexts have been considered here.
HTK toolkit has been used here with the help of MATLAB for
extracting the feature, building the HMMs, and generating the
results.
We have organized this paper into six sections including
this introduction section. Section II presents a general review
of some previous work in the field of ASR using MFCC with
acceleration coefficients. In section III, the methodology is
briefly described. Section IV provides the experimental setup.
The experimental results are discussed in section V.
Conclusion and an overview of future work are given in section
VI.
II. LITERATURE REVIEW
Researchers have performed a lot of experiments in the
field of ASR with different approaches to obtain acceptable
accuracy of speech recognition.