Exploring the Role of Spectral Smoothing in context of Children’s Speech Recognition Shweta Ghai and Rohit Sinha Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India. {shweta,rsinha}@iitg.ernet.in Abstract This work is motivated by our earlier study which shows that on explicit pitch normalization the children’s speech recognition performance on the adults’ speech trained models improves as a result of reduction in the pitch-dependent distortions in the spectral envelope. In this paper, we study the role of spec- tral smoothing in context of children’s speech recognition. The spectral smoothing has been effected in the feature domain by two approaches viz., modiﬁcation of bandwidth of the ﬁlters in the ﬁlterbank and cepstral truncation. In conjunction, both ap- proaches give signiﬁcant improvement in the children’s speech recognition performance with 57% relative improvement over the baseline. Also, when combined with the widely used vocal tract length normalization (VTLN), these spectral smoothing approaches result in an additional 25% relative improvement over the VTLN performance for children’s speech recognition on the adults’ speech trained models. Index Terms: children’s speech recognition, spectral smooth- ing, cepstral truncation 1. Introduction The automatic recognition of children’s speech on the adult’s speech trained models is a challenging problem and has re- ceived a lot of attention in the literature [1]-[10]. Various acous- tic and linguistic correlates of speech like pitch, formant fre- quencies, average phone duration, speaking rate, pronunciation and grammar have been attributed for the degradation in the children’s speech recognition performance on the adults’ speech trained models [1][4]. In literature, various techniques that have been applied to address the mismatch between the children’s and the adults’ speech include vocal tract length normalization (VTLN) [2][3][5], speaker adaptation techniques using maxi- mum likelihood linear regression (MLLR) [2][5], speaker adap- tive training [2][5], use of age-dependent acoustic models [6], language modeling [7][8], and pronunciation modeling [9]. The use of pitch reduction along with VTLN has also been shown to be effective for normalizing the children’s speech for recogni- tion with adult’s speech trained models [10]. Following the work [10], we also found the pitch variations to have signiﬁcant effect on the children’s speech recognition performance, especially for high pitch signals [11]. To further explore this fact, a detailed analysis of the possible cause of such pitch dependence of the recognition performance for chil- dren’s speech has been carried out. The study revealed that with increasing pitch of the signals the variances of the higher dimensions of mel frequency cepstral coefﬁcients (MFCC) fea- ture also increase signiﬁcantly. On further analyzing the smooth spectra corresponding to MFCC feature, the increased variances of MFCC have been attributed to the pitch-dependent distor- tions observed in the spectral envelope, particularly for high pitch signals. As the distortions in the spectral envelope were noted predominantly below 1 kHz, it has been hypothesized that these distortions occur due to insufﬁcient smoothing of the pitch harmonics by the lower order ﬁlters of the ﬁlterbank hav- ing bandwidth of around 100 Hz. On normalizing the pitch by explicitly modifying the pitch of the signals, a 15% relative im- provement has been obtained for children’s speech recognition. This work has been reported in [12] and has been submitted as a companion paper. Motivated by our work [12] involving explicit pitch nor- malization, in this paper we have experimented with the role of spectral smoothing in context of children’s speech recognition. Initially, the bandwidth of the lower order ﬁlters in the ﬁlter- bank has been modiﬁed so as to smooth out the pitch-dependent distortions observed in the low frequency region of the spectral envelope. In addition to this, the experiments with varying cep- stral truncation have also been performed so as to reduce the inﬂuence of the increase in variances of the higher cepstral co- efﬁcients with increasing pitch on the recognition performance. This in turn smooths the corresponding spectra implicitly. Both methods have shown the effectiveness of spectral smoothing in improving children’s speech recognition. Apart from this, the effect of smoothing has also been studied in conjunction with the widely used VTLN in context of children’s speech recogni- tion. The rest of the paper is organized as follows. Section 2 presents the experimental setup and the database used in this work. The experimental results obtained with the ﬁlterbank based and the cepstral truncation based spectral smoothing ap- proaches are described in Section 3 and Section 4, respectively. Further, the combination of spectral smoothing and VTLN is explored in Section 5. Finally, the paper is concluded in Sec- tion 6. 2. Experimental Setup and Databases The connected digit recognizer used in this work is developed using the HTK toolkit. The 11 digits (0-9 and OH) are mod- eled as whole word hidden Markov models (HMMs) using 16 states per word. Each state is a mixture of 5 diagonal covariance Gaussian distributions with simple left-to-right moves without any skips over the states. The speech is analyzed with a Ham- ming window of length 25 ms, frame rate of 100 Hz and pre- emphasis factor of 0.97. A 21-channel Mel-ﬁlterbank is used for computing MFCC feature. This ﬁlterbank, as implemented in HTK, is referred to as the ‘default’ ﬁlterbank in this work. The feature vectors comprise of 13 static MFCC and their ﬁrst Copyright  2009 ISCA 6 - 10 September, Brighton UK 1607 10.21437/Interspeech.2009-209