I.J. Intelligent Systems and Applications, 2018, 3, 22-32
Published Online March 2018 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijisa.2018.03.03
Copyright © 2018 MECS I.J. Intelligent Systems and Applications, 2018, 3, 22-32
Creation and Comparison of Language and
Acoustic Models Using Kaldi for Noisy and
Enhanced Speech Data
Thimmaraja Yadava G
Research Scholar, Department of Electronics and Communication Engineering
Siddaganga Institute of Technology, Tumakuru, Karnataka, India
E-mail: thimrajyadav@gmail.com
H S Jayanna
Professor, Department of Information Science and Engineering
Siddaganga Institute of Technology, Tumakuru, Karnataka, India
E-mail: jayannahs@gmail.com
Received: 17 June 2017; Accepted: 29 July 2017; Published: 08 March 2018
Abstract—In this work, the Language Models (LMs) and
Acoustic Models (AMs) are developed using the speech
recognition toolkit Kaldi for noisy and enhanced speech
data to build an Automatic Speech Recognition (ASR)
system for Kannada language. The speech data used for
the development of ASR models is collected under
uncontrolled environment from the farmers of different
dialect regions of Karnataka state. The collected speech
data is preprocessed by proposing a method for noise
elimination in the degraded speech data. The proposed
method is a combination of Spectral Subtraction with
Voice Activity Detection (SS-VAD) and Minimum Mean
Square Error-Spectrum Power Estimator (MMSE-SPZC)
based on Zero Crossing. The word level transcription and
validation of speech data is done by Indic language
transliteration tool (IT3 to UTF-8). The Indian Language
Speech Label (ILSL12) set is used for the development
of Kannada phoneme set and lexicon. The 75% and 25%
of transcribed and validated speech data is used for
system training and testing respectively. The LMs are
generated by using the Kannada language resources and
AMs are developed by using Gaussian Mixture Models
(GMM) and Subspace Gaussian Mixture Models
(SGMM). The proposed method is studied determinedly
and used for enhancing the degraded speech data. The
Word Error Rates (WERs) of ASR models for noisy and
enhanced speech data are highlighted and discussed in
this work. The developed ASR models can be used in
spoken query system to access the real time agricultural
commodity price and weather information in Kannada
language.
Index Terms—Language Models (LMs), Acoustic
Models (AMs), Kaldi, Automatic Speech Recognition
(ASR), Word Error Rates (WERs).
I. INTRODUCTION
Speech is one of the most important types of
communication among the human beings. The
communication between human being is successful only
when there is no distortion in dialogue. Recognizing the
word uttered by the speaker is a challenging role and it is
called speech recognition [1]. Speech enhancement is
mainly depends on the human perceptual factors and
signal processing applications. The speech data collected
in the real time environment is noisy in nature. Normally
speech is corrupted by several degradations such as
background noise, vocal noise, factory noise, f16 noise,
babble noise and reverberations etc. The noise reduction
in degraded speech data is a challenging task [2]. The
Spectral Subtraction (SS) method is most widely used for
speech enhancement. This method is mainly associated
with Voice Activity Detection (VAD). To find the active
regions of degraded speech signal, VAD is used [3]. The
degraded speech signal is processed by considering both
low signal to noise ratio (SNR) and high SNR regions.
The degraded speech segments are processed frame by
frame with duration of 20 ms. The SS-VAD method was
proposed for speech enhancement in [4-8]. The effect of
noise can be eliminated in degraded speech signal by
subtracting the average magnitude spectrum of noise
model from the average magnitude spectrum of degraded
speech signal.
The speech signal Magnitude Squared Spectrum (MSS)
estimators were proposed for noise reduction in degraded
speech signal in [9-11]. The MSS estimators namely,
Minimum Mean Square Error-Short Time Power
spectrum (MMSESP), Minimum Mean Square Error-
Spectrum Power based on Zero Crossing (MMSE-SPZC)