Feature Bandwidth Extension for Persian Conversational Telephone Speech Recognition Mohammad Mohsen Goodarzi 1,2 , Farshad Almasganj 1,2 , Jahanshah Kabudian 1,3 , Yasser Shekofteh 1,2 , Iman Sarraf Rezaei 1 1 Research Center for Intelligent Signal Processing (RCISP), Tehran, Iran 2 Department of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran 3 Department of Computer Engineering, Razi University, Kermanshah, Iran Emails: {mm.goodarzi, almas, y_shekofteh, imansarraf}@aut.ac.ir, kabudian@rcisp.ac.ir AbstractConfiguring a whole setup with application of continuous conversational telephony speech recognition in Persian is the goal of this paper. For this propose, two common methods, Gaussian Mixture Model (GMM) and Neural Network (NN) and a proposed hybrid GMM-NN method have been considered to estimate full-bandwidth features from band-limited features. Performances of these methods have been evaluated with two different spectral and cepstral based features, LFBE and MFCC. Also, the effect of speaker gender in estimation process has been investigated. Our results showed that best phoneme recognition accuracy is obtained when MFCC features are reconstructed using two gender dependent neural networks. In this configuration, phoneme accuracy was about 1.6 % more than baseline. The tests were applied on TFarsDat corpus. Keywords-conversational telephony speech recognition; feature bandwidth extension; Gaussian mixture model; neural network I. INTRODUCTION Although recent automatic speech recognition (ASR) systems perform well in full-bandwidth clean speech, their performance degrades significantly in band limited tasks. This degradation is more sensible when working with telephony speech because of the effect of transmission channel as well as low sampling rates. On the other hand, wide varieties of ASR applications are in the telephone based connections, when a machine is responsible to interact with a costumer. The problem of low bandwidth speech signal is not limited to ASR tasks and is a challenging problem to enhance the audibility of received speech signal. Mentioned problems, recently, have attracted more research on methods to extend the bandwidth of speech both in recognition and audibility applications. These studies include various approaches. Source-filter model is one of widely used methods for audibility applications. In this approach, the speech signal is broken down in to two parts: excitation and spectral envelope. Then the high band is reconstructed by extending both excitation and spectral envelope of low band. Effect of these parts has been discussed in [1] and concluded that the contribution of extending spectral envelope has a greater effect on final result than excitation part. On the other hand, approaches that target the recognition task try to extend bandwidth in feature domain. Methods that are used to reconstruct full band features are almost the same methods that are used to reconstruct high band of spectral envelope. So in this paper we focus on feature domain bandwidth extension methods while this method could also be used in a source-filter model to enhance quality of narrowband speech. Most common method to reconstruct high band components is based on GMM used to model joint probability distribution of narrow band and wide band features [2, 3]. In feature domain, GMM has been used to model LFBE (Logarithm of Filter Bank Energies) [2], MFCC [4] and LPC based features [5]. To improve GMM as a static model, in [5], a dynamic model, HMM, has been proposed to model joint distribution. Then, for frames that belong to each state, a group of weight vectors is computed using linear predictive (LP) method to extract high band features from its low band. Beside these statistical approaches, different structures of neural network (NN) such as feed forward and bidirectional [6] have been used to estimate high band. In this paper, we investigate two common approaches, GMM and NN on Persian continuous speech recognition task on telephone speech. Also we propose a hybrid GMM-NN approach. We evaluate these methods on both LFBE and MFCC features and choose the best method for the task of Persian speech recognition. Next section describes the details of GMM, NN and GMM- NN approaches to estimate full-bandwidth features. Section 3 reveals our experimental results and finally, concluding remarks and discussion are prepared in section 4. II. ESTIMATING FULL BANDWIDTH FEATURES In this paper we aim to define a whole setup with the goal of continuous conversational telephone speech recognition in Persian. For this propose we have evaluated two common methods, GMM and NN for estimating full-bandwidth features from band-limited features. As far as we know, GMM, never been used for band-limited features extension for speech recognition in Persian. In [6] attempted to use NN for this purpose and achieved noticeable results but still lower than performance when recognition system is trained on telephony features. Also we evaluate these methods with two common features kind in this field i.e. LFBE and MFCC. Beside this