International Journal of Electrical and Computer Engineering (IJECE)
Vol. 8, No. 6, December 2018, pp. 5381~5388
ISSN: 2088-8708, DOI: 10.11591/ijece.v8i6.pp5381-5388 5381
Journal homepage: http://iaescore.com/journals/index.php/IJECE
Convolutional Neural Network and Feature Transformation for
Distant Speech Recognition
Hilman F. Pardede, Asri R. Yuliani, Rika Sustika
Research Center for Informatics, Indonesian Institute of Sciences, Indonesia
Article Info ABSTRACT
Article history:
Received Jan 5, 2018
Revised Jul 27, 2018
Accepted Aug 7, 2018
In many applications, speech recognition must operate in conditions where
there are some distances between speakers and the microphones. This is
called distant speech recognition (DSR). In this condition, speech recognition
must deal with reverberation. Nowadays, deep learning technologies are
becoming the the main technologies for speech recognition. Deep Neural
Network (DNN) in hybrid with Hidden Markov Model (HMM) is the
commonly used architecture. However, this system is still not robust against
reverberation. Previous studies use Convolutional Neural Networks (CNN),
which is a variation of neural network, to improve the robustness of speech
recognition against noise. CNN has the properties of pooling which is used to
find local correlation between neighboring dimensions in the features. With
this property, CNN could be used as feature learning emphasizing the
information on neighboring frames. In this study we use CNN to deal with
reverberation. We also propose to use feature transformation techniques:
linear discriminat analysis (LDA) and maximum likelihood linear
transformation (MLLT), on mel frequency cepstral coefficient (MFCC)
before feeding them to CNN. We argue that transforming features could
produce more discriminative features for CNN, and hence improve the
robustness of speech recognition against reverberation. Our evaluations on
Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that
the use of LDA and MLLT transformations improve the robustness of speech
recognition. It is better by 20% relative error reduction on compared to a
standard DNN based speech recognition using the same number of
hidden layers.
Keyword:
CNN
Distant Speech Recognition
Feature transformation
LDA
MLLT
Reverberation
Copyright © 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Hilman F. Pardede
Research Center for Informatics,
Jl. Cisitu No. 21/154D Bandung, Indonesia.
Email: hilm001@lipi.go.id
1. INTRODUCTION
Deep Learning technologies have recently achieved huge success in acoustic modelling for
automatic speech recognition (ASR) tasks [1]-[4]. They replace conventional Hidden Markov
Models-Gaussian Mixture Models (HMM-GMM) [5], [6]. Curently, Deep Neural Network (DNN) is the
state-of-the-art architecture for speech recognition. DNN is used to provide posterior probability to HMM
based on a set of learned features. A hybrid of HMM-DNN has shown to have superior performance
compared to HMM-GMM models for ASR.
Currently, more automatic speech recognition (ASR) applications found in our daily activities. They
have been implemented as virtual assistant in smart-phones, home automation, meeting diarisation, and so
on. For such applications, ASR must operate in conditions where there are some distances between the
speakers and the microphones. This is called distant speech recognition (DSR). In such conditions, ASR
systems are expected to be robust against noise and reverberation. However, the performance of DNN-HMM