LOCALIZED SPECTRO-TEMPORAL CEPSTRAL ANALYSIS OF SPEECH Jake Bouvrie, Tony Ezzat, and Tomaso Poggio Center for Biological and Computational Learning Massachusetts Institute of Technology, Cambridge, MA jvb@mit.edu, tonebone@mit.edu, tp@ai.mit.edu ABSTRACT Drawing on recent progress in auditory neuroscience, we present a novel speech feature analysis technique based on localized spectro- temporal cepstral analysis of speech. We proceed by extracting localized 2D patches from the spectrogram and project onto a 2D discrete cosine (2D-DCT) basis. For each time frame, a speech feature vector is then formed by concatenating low-order 2D- DCT coefficients from the set of corresponding patches. We argue that our framework has significant advantages over standard one- dimensional MFCC features. In particular, we find that our features are more robust to noise, and better capture temporal modulations important for recognizing plosive sounds. We evaluate the perfor- mance of the proposed features on a TIMIT classification task in clean, pink, and babble noise conditions, and show that our feature analysis outperforms traditional features based on MFCCs. Index TermsSpeech processing, Speech recognition, Cep- stral analysis, Nervous system 1. INTRODUCTION Most state-of-the-art speech recognition systems today use some form of MEL-scale frequency cepstral coefficients (MFCCs) as their acoustic feature representation. MFCCs are computed in three ma- jor processing steps: first, a short-time Fourier transform (STFT) is computed from a time waveform. Then, over each spectral slice, a bank of triangular filters spaced according to the MEL-frequency scale is applied. Finally, a 1-D discrete cosine transform (1D-DCT) is applied to each filtered frame, and only the first N coefficients are kept. This process effectively retains only the smooth envelope profile from each spectral slice, reduces the dimensionality of each temporal frame, and decorrelates the features. Although MFCCs have become a mainstay of ASR systems, ma- chines still significantly under-perform humans in both noise-free and noisy conditions [13]. In the work presented here, we turn to recent studies of the mammalian auditory cortex [4, 16] in an at- tempt to bring machine performance towards that of humans via biologically-inspired feature analyses of speech. These neurophys- iological studies reveal that cortical cells in the auditory pathway have two important properties which are distinctly not captured by standard MFCC features, and which we will explore in this work. Firstly, rather than being tuned to purely spectral modulations, the receptive fields of cortical cells are instead tuned to both spectral and temporal modulations. In particular, auditory cells are tuned to modulations with long temporal extent, on the order of 50-200ms [4, 16]. In contrast, MFCC features are tuned only to spectral modu- lations: each 1D DCT basis may be viewed as a matched filter that responds strongly when the spectral slice it is applied to contains the spectral modulation encoded by the basis. MFCC coefficients thus indicate the degree to which certain spectral modulations are present in each spectral slice. The augmentation of MFCCs with Δ and ΔΔ features clearly incorporates more temporal information, but this is not equivalent to building a feature set with explicit tuning to par- ticular temporal modulations (or joint spectro-temporal modulations for that matter). Furthermore, the addition of Δ and ΔΔ features creates a temporal extent of only 30-50ms, which is still far shorter than the duration of temporal sensitivities found in cortical cells. Secondly, the above neurophysiological studies further show that cortical cells are tuned to localized spectro-temporal patterns: the spectral span of auditory cortical neurons is typically 1-2 oc- taves [4, 16]. In contrast, MFCC features have a global frequency span, in the sense that the spectral modulation “templates” being matched to the slice span the entire frequency range. One immedi- ate disadvantage of the global nature of MFCCs is that it reduces noise-robustness: addition of noise in a small subband affects the entire representation. Motivated by these findings, we propose a new speech feature representation which is localized in the time-frequency plane, and is explicitly tuned to spectro-temporal modulations: we extract small overlapping 2D spectro-temporal patches from the spectrogram, project those patches onto a 2D discrete cosine basis, and retain only the low-order 2D-DCT coefficients. The 2D-DCT basis forms a biologically-plausible matched filter set with the explicit joint spectro-temporal tuning we seek. Furthermore, by localizing the representation of the spectral envelope, we develop a feature set that is robust to additive noise. 2. BACKGROUND A large number of researchers have recently explored novel speech feature representations in an effort to improve the performance of speech recognizers, but to the best of our knowledge none of these features have combined localization, sensitivity to spectro-temporal modulations, and low dimensionality. Hermansky [7] and Bourlard [2] have used localized sub-band features for speech recognition, but their features were purely spec- tral and failed to capture temporal information. Subsequently, through their TRAP-TANDEM framework, Hermansky, Morgan and collaborators [7, 3] explored the use of long but thin tempo- ral slices of critical-band energies for recognition, however these features lack joint spectro-temporal sensitivity. Kajarekar et al. [8] found that both spectral and temporal analyses performed in se- quential order outperformed joint spectro-temporal features within a linear discriminant framework, however we have found joint 2D- DCT features to outperform combinations of purely spectral or temporal features. Atlas and Shamma [1] also explored temporal modulation sensitivity by computing a 1D-FFT of the critical band energies from a spectrogram. These features too lack joint and lo- calized spectro-temporal modulation sensitivity. Kitamura et al. [9], take a global 2D-FFT of a MEL-scale spectrogram, and discard