Single Channel Dereverberation Method in Log-Melspectral Domain Using Limited Stereo Data for Distant Speaker Identification Aditya Arie Nugraha * , Kazumasa Yamamoto *† , Seiichi Nakagawa * * Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi, Japan Department of Information and Computer Engineering, Toyota National College of Technology, Toyota, Japan E-mail: {arie, kyama, nakagawa}@slp.cs.tut.ac.jp Abstract—In this paper, we present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. We assumed that the dimensions of feature were independent from each other and experimented on several assumptions of the room transfer function for each dimension. Speaker identification system was used to evaluate the method. Using limited stereo data, we could improve the identification rate for simulated and real datasets. On the simulated dataset, we could show that the proposed method is effective for both noiseless and noisy reverberant environments, with various noise and reverberation characteristics. On the real dataset, we could show that by using 6 independent NNs configuration for 24- dimensional feature and only 1 pair of utterances we could get 35% average error reduction relative to the baseline, which employed cepstral mean normalization (CMN). I. I NTRODUCTION The use of a distant-talking microphone for automatic speech/speaker recognition (ASR) system can improve user convenience. However, the use of reverberant signal captured by the microphone may degrade the system performance. Several feature enhancement approaches have been pro- posed to deal with the reverberation problem; vector Taylor series (VTS) [1], particle filter [2], Kalman filter [3], and so on. Several methods assume that stereo training data can be acquired. In the context of distant speaker identification, stereo data are simultaneously recorded pairs of close-talking and distant-talking utterances. In [4], 13 multilayer percep- tron (MLP) NNs were trained using stereo data to map the 13-dimensional reverberant cepstral feature, where one NN was used for one dimension of feature, to its corresponding anechoic feature. The input of each NN was a sequence of cepstral feature coefficients from 9 consecutive frames and the output was a cepstral feature coefficient. For the noise problem, SPLICE is a feature enhancement approach which also needs stereo data [5]. It estimates the clean cepstral feature from the noisy feature using a Gaussian Mixture Model (GMM) of noisy feature. Several algorithms for distant text-independent speaker identification have been proposed, e.g. GMM, GMM-Universal Background Model (GMM-UBM), Support Vector Machine (SVM) [6]. Several more robust features also have been pro- posed, e.g. modulation spectral features [7] and short segment cepstral coefficient (SSCC) [8]. In [9], we introduced a single channel non-linear regression based dereverberation method using a single NN for distant speaker identification. The NN was trained on stereo data to compensate the reverberation effect by mapping the reverber- ant feature in a log-melspectral domain to its corresponding anechoic feature. The log-melspectral domain was used be- cause it gave us a compressed representation of mel-filterbank output, which was beneficial for the NN. According to [10], several feature enhancement approaches work better in the log-spectral domain than in the power spectral domain. The log-spectral domain has also a linear relation to the cepstral domain, which is the final feature in many ASR system. We use cascade NNs trained using Cascade2 algorithm, which is a variation of Cascade-Correlation (CasCor) algo- rithm [11]. Comparing to MLP, CasCor family does not have the issue of deciding the number of layers and hidden neurons to use in NN before the training. Cascade2 is used because it uses error minimization instead of covariance maximization, so it is suitable for our regression task. In this paper, we extend the method to the use of multiple NNs by modifying our assumptions about the room transfer function for each dimension of log-melspectral feature. We also show how the difference of assumptions affects the performance of distant speaker identification system, which used MFCC-based speaker-specific GMMs as the speaker models [12], by using limited stereo data. We believe that the use of CasCor and the possibility of using limited stereo data increase the feasibility of our method. II. REVERBERATION MODEL The relation between anechoic and reverberant signal in log-melspectral domain should be represented as a non-linear model [3]. However, for simplicity, we defined it as y j (t)= α j,0 s j (t)+ N n=1 α j,i s j (t n), (1) where s j (t) and y j (t) represent the log-melspectral co- efficients of anechoic and reverberant signal, respectively, for feature dimension j and frame index t [13]. While, α j,0 j,1 ,...,α j,N represent the room transfer function (RTF) for feature dimension j . Then, the estimated anechoic coefficient ˆ s j (t) could be expressed as ˆ s j (t)= β j,0 y j (t)+ L k=1 β j,k y j (t k)+ ε L k=0 β j,k y j (t k), (2)