International Journal of Electrical and Computer Engineering (IJECE) Vol. 8, No. 6, December 2018, pp. 5381~5388 ISSN: 2088-8708, DOI: 10.11591/ijece.v8i6.pp5381-5388 5381 Journal homepage: http://iaescore.com/journals/index.php/IJECE Convolutional Neural Network and Feature Transformation for Distant Speech Recognition Hilman F. Pardede, Asri R. Yuliani, Rika Sustika Research Center for Informatics, Indonesian Institute of Sciences, Indonesia Article Info ABSTRACT Article history: Received Jan 5, 2018 Revised Jul 27, 2018 Accepted Aug 7, 2018 In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becoming the the main technologies for speech recognition. Deep Neural Network (DNN) in hybrid with Hidden Markov Model (HMM) is the commonly used architecture. However, this system is still not robust against reverberation. Previous studies use Convolutional Neural Networks (CNN), which is a variation of neural network, to improve the robustness of speech recognition against noise. CNN has the properties of pooling which is used to find local correlation between neighboring dimensions in the features. With this property, CNN could be used as feature learning emphasizing the information on neighboring frames. In this study we use CNN to deal with reverberation. We also propose to use feature transformation techniques: linear discriminat analysis (LDA) and maximum likelihood linear transformation (MLLT), on mel frequency cepstral coefficient (MFCC) before feeding them to CNN. We argue that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation. Our evaluations on Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that the use of LDA and MLLT transformations improve the robustness of speech recognition. It is better by 20% relative error reduction on compared to a standard DNN based speech recognition using the same number of hidden layers. Keyword: CNN Distant Speech Recognition Feature transformation LDA MLLT Reverberation Copyright © 2018 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Hilman F. Pardede Research Center for Informatics, Jl. Cisitu No. 21/154D Bandung, Indonesia. Email: hilm001@lipi.go.id 1. INTRODUCTION Deep Learning technologies have recently achieved huge success in acoustic modelling for automatic speech recognition (ASR) tasks [1]-[4]. They replace conventional Hidden Markov Models-Gaussian Mixture Models (HMM-GMM) [5], [6]. Curently, Deep Neural Network (DNN) is the state-of-the-art architecture for speech recognition. DNN is used to provide posterior probability to HMM based on a set of learned features. A hybrid of HMM-DNN has shown to have superior performance compared to HMM-GMM models for ASR. Currently, more automatic speech recognition (ASR) applications found in our daily activities. They have been implemented as virtual assistant in smart-phones, home automation, meeting diarisation, and so on. For such applications, ASR must operate in conditions where there are some distances between the speakers and the microphones. This is called distant speech recognition (DSR). In such conditions, ASR systems are expected to be robust against noise and reverberation. However, the performance of DNN-HMM