LOCALIZED SPECTRO-TEMPORAL CEPSTRAL ANALYSIS OF SPEECH Jake Bouvrie, Tony Ezzat, and Tomaso Poggio Center for Biological and Computational Learning Massachusetts Institute of Technology, Cambridge, MA jvb@mit.edu, tonebone@mit.edu, tp@ai.mit.edu ABSTRACT Drawing on recent progress in auditory neuroscience, we present a novel speech feature analysis technique based on localized spectro- temporal cepstral analysis of speech. We proceed by extracting localized 2D patches from the spectrogram and project onto a 2D discrete cosine (2D-DCT) basis. For each time frame, a speech feature vector is then formed by concatenating low-order 2D- DCT coefﬁcients from the set of corresponding patches. We argue that our framework has signiﬁcant advantages over standard one- dimensional MFCC features. In particular, we ﬁnd that our features are more robust to noise, and better capture temporal modulations important for recognizing plosive sounds. We evaluate the perfor- mance of the proposed features on a TIMIT classiﬁcation task in clean, pink, and babble noise conditions, and show that our feature analysis outperforms traditional features based on MFCCs. Index Terms— Speech processing, Speech recognition, Cep- stral analysis, Nervous system 1. INTRODUCTION Most state-of-the-art speech recognition systems today use some form of MEL-scale frequency cepstral coefﬁcients (MFCCs) as their acoustic feature representation. MFCCs are computed in three ma- jor processing steps: ﬁrst, a short-time Fourier transform (STFT) is computed from a time waveform. Then, over each spectral slice, a bank of triangular ﬁlters spaced according to the MEL-frequency scale is applied. Finally, a 1-D discrete cosine transform (1D-DCT) is applied to each ﬁltered frame, and only the ﬁrst N coefﬁcients are kept. This process effectively retains only the smooth envelope proﬁle from each spectral slice, reduces the dimensionality of each temporal frame, and decorrelates the features. Although MFCCs have become a mainstay of ASR systems, ma- chines still signiﬁcantly under-perform humans in both noise-free and noisy conditions [13]. In the work presented here, we turn to recent studies of the mammalian auditory cortex [4, 16] in an at- tempt to bring machine performance towards that of humans via biologically-inspired feature analyses of speech. These neurophys- iological studies reveal that cortical cells in the auditory pathway have two important properties which are distinctly not captured by standard MFCC features, and which we will explore in this work. Firstly, rather than being tuned to purely spectral modulations, the receptive ﬁelds of cortical cells are instead tuned to both spectral and temporal modulations. In particular, auditory cells are tuned to modulations with long temporal extent, on the order of 50-200ms [4, 16]. In contrast, MFCC features are tuned only to spectral modu- lations: each 1D DCT basis may be viewed as a matched ﬁlter that responds strongly when the spectral slice it is applied to contains the spectral modulation encoded by the basis. MFCC coefﬁcients thus indicate the degree to which certain spectral modulations are present in each spectral slice. The augmentation of MFCCs with Δ and ΔΔ features clearly incorporates more temporal information, but this is not equivalent to building a feature set with explicit tuning to par- ticular temporal modulations (or joint spectro-temporal modulations for that matter). Furthermore, the addition of Δ and ΔΔ features creates a temporal extent of only 30-50ms, which is still far shorter than the duration of temporal sensitivities found in cortical cells. Secondly, the above neurophysiological studies further show that cortical cells are tuned to localized spectro-temporal patterns: the spectral span of auditory cortical neurons is typically 1-2 oc- taves [4, 16]. In contrast, MFCC features have a global frequency span, in the sense that the spectral modulation “templates” being matched to the slice span the entire frequency range. One immedi- ate disadvantage of the global nature of MFCCs is that it reduces noise-robustness: addition of noise in a small subband affects the entire representation. Motivated by these ﬁndings, we propose a new speech feature representation which is localized in the time-frequency plane, and is explicitly tuned to spectro-temporal modulations: we extract small overlapping 2D spectro-temporal patches from the spectrogram, project those patches onto a 2D discrete cosine basis, and retain only the low-order 2D-DCT coefﬁcients. The 2D-DCT basis forms a biologically-plausible matched ﬁlter set with the explicit joint spectro-temporal tuning we seek. Furthermore, by localizing the representation of the spectral envelope, we develop a feature set that is robust to additive noise. 2. BACKGROUND A large number of researchers have recently explored novel speech feature representations in an effort to improve the performance of speech recognizers, but to the best of our knowledge none of these features have combined localization, sensitivity to spectro-temporal modulations, and low dimensionality. Hermansky [7] and Bourlard [2] have used localized sub-band features for speech recognition, but their features were purely spec- tral and failed to capture temporal information. Subsequently, through their TRAP-TANDEM framework, Hermansky, Morgan and collaborators [7, 3] explored the use of long but thin tempo- ral slices of critical-band energies for recognition, however these features lack joint spectro-temporal sensitivity. Kajarekar et al. [8] found that both spectral and temporal analyses performed in se- quential order outperformed joint spectro-temporal features within a linear discriminant framework, however we have found joint 2D- DCT features to outperform combinations of purely spectral or temporal features. Atlas and Shamma [1] also explored temporal modulation sensitivity by computing a 1D-FFT of the critical band energies from a spectrogram. These features too lack joint and lo- calized spectro-temporal modulation sensitivity. Kitamura et al. [9], take a global 2D-FFT of a MEL-scale spectrogram, and discard