Exploring the Role of Spectral Smoothing in context of Children’s Speech Recognition Shweta Ghai and Rohit Sinha Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India. {shweta,rsinha}@iitg.ernet.in Abstract This work is motivated by our earlier study which shows that on explicit pitch normalization the children’s speech recognition performance on the adults’ speech trained models improves as a result of reduction in the pitch-dependent distortions in the spectral envelope. In this paper, we study the role of spec- tral smoothing in context of children’s speech recognition. The spectral smoothing has been effected in the feature domain by two approaches viz., modification of bandwidth of the filters in the filterbank and cepstral truncation. In conjunction, both ap- proaches give significant improvement in the children’s speech recognition performance with 57% relative improvement over the baseline. Also, when combined with the widely used vocal tract length normalization (VTLN), these spectral smoothing approaches result in an additional 25% relative improvement over the VTLN performance for children’s speech recognition on the adults’ speech trained models. Index Terms: children’s speech recognition, spectral smooth- ing, cepstral truncation 1. Introduction The automatic recognition of children’s speech on the adult’s speech trained models is a challenging problem and has re- ceived a lot of attention in the literature [1]-[10]. Various acous- tic and linguistic correlates of speech like pitch, formant fre- quencies, average phone duration, speaking rate, pronunciation and grammar have been attributed for the degradation in the children’s speech recognition performance on the adults’ speech trained models [1][4]. In literature, various techniques that have been applied to address the mismatch between the children’s and the adults’ speech include vocal tract length normalization (VTLN) [2][3][5], speaker adaptation techniques using maxi- mum likelihood linear regression (MLLR) [2][5], speaker adap- tive training [2][5], use of age-dependent acoustic models [6], language modeling [7][8], and pronunciation modeling [9]. The use of pitch reduction along with VTLN has also been shown to be effective for normalizing the children’s speech for recogni- tion with adult’s speech trained models [10]. Following the work [10], we also found the pitch variations to have significant effect on the children’s speech recognition performance, especially for high pitch signals [11]. To further explore this fact, a detailed analysis of the possible cause of such pitch dependence of the recognition performance for chil- dren’s speech has been carried out. The study revealed that with increasing pitch of the signals the variances of the higher dimensions of mel frequency cepstral coefficients (MFCC) fea- ture also increase significantly. On further analyzing the smooth spectra corresponding to MFCC feature, the increased variances of MFCC have been attributed to the pitch-dependent distor- tions observed in the spectral envelope, particularly for high pitch signals. As the distortions in the spectral envelope were noted predominantly below 1 kHz, it has been hypothesized that these distortions occur due to insufficient smoothing of the pitch harmonics by the lower order filters of the filterbank hav- ing bandwidth of around 100 Hz. On normalizing the pitch by explicitly modifying the pitch of the signals, a 15% relative im- provement has been obtained for children’s speech recognition. This work has been reported in [12] and has been submitted as a companion paper. Motivated by our work [12] involving explicit pitch nor- malization, in this paper we have experimented with the role of spectral smoothing in context of children’s speech recognition. Initially, the bandwidth of the lower order filters in the filter- bank has been modified so as to smooth out the pitch-dependent distortions observed in the low frequency region of the spectral envelope. In addition to this, the experiments with varying cep- stral truncation have also been performed so as to reduce the influence of the increase in variances of the higher cepstral co- efficients with increasing pitch on the recognition performance. This in turn smooths the corresponding spectra implicitly. Both methods have shown the effectiveness of spectral smoothing in improving children’s speech recognition. Apart from this, the effect of smoothing has also been studied in conjunction with the widely used VTLN in context of children’s speech recogni- tion. The rest of the paper is organized as follows. Section 2 presents the experimental setup and the database used in this work. The experimental results obtained with the filterbank based and the cepstral truncation based spectral smoothing ap- proaches are described in Section 3 and Section 4, respectively. Further, the combination of spectral smoothing and VTLN is explored in Section 5. Finally, the paper is concluded in Sec- tion 6. 2. Experimental Setup and Databases The connected digit recognizer used in this work is developed using the HTK toolkit. The 11 digits (0-9 and OH) are mod- eled as whole word hidden Markov models (HMMs) using 16 states per word. Each state is a mixture of 5 diagonal covariance Gaussian distributions with simple left-to-right moves without any skips over the states. The speech is analyzed with a Ham- ming window of length 25 ms, frame rate of 100 Hz and pre- emphasis factor of 0.97. A 21-channel Mel-filterbank is used for computing MFCC feature. This filterbank, as implemented in HTK, is referred to as the ‘default’ filterbank in this work. The feature vectors comprise of 13 static MFCC and their first Copyright 2009 ISCA 6 - 10 September, Brighton UK 1607 10.21437/Interspeech.2009-209