1 Automatic Speech Recognition with an Adaptation Model Motivated by Auditory Processing Marcus Holmberg, David Gelbart, Student Member, IEEE, and Werner Hemmert, Member, IEEE Abstract— The Mel-Frequency Cepstral Coefficient (MFCC) or Perceptual Linear Prediction (PLP) feature extraction typically used for automatic speech recognition (ASR) employ several principles which have known counterparts in the cochlea and auditory nerve: frequency decomposition, mel- or bark-warping of the frequency axis, and compression of amplitudes. It seems natural to ask if one can profitably employ a counterpart of the next physiological processing step, synaptic adaptation. We therefore incorporated a simplified model of short-term adapta- tion into MFCC feature extraction. We evaluated the resulting ASR performance on the AURORA 2 and AURORA 3 tasks, in comparison to ordinary MFCCs, MFCCs processed by RASTA, and MFCCs processed by cepstral mean subtraction (CMS), and both in comparison to and in combination with Wiener filtering. The results suggest that our approach offers a simple, causal robustness strategy which is competitive with RASTA, CMS and Wiener filtering and performs well in combination with Wiener filtering. Compared to the structurally related RASTA, our adaptation model provides superior performance on AURORA2 and, if Wiener filtering is used prior to both approaches, on AURORA 3 as well. Index Terms— Neural adaptation, speech recognition, noise robustness. EDICS Category: 1-RECO Werner Hemmert*, Infineon Technologies AG Corporate Research CPR ST, Building 10-562, Otto-Hahn-Ring 6, 81730 Munich, Germany, Tel.: +49 (89) 234-53055, Fax: +49 (89) 234-9557068, werner.hemmert@infineon.com Marcus Holmberg, Infineon Technologies AG, Corporate Research CPR ST, Building 10-562, Otto-Hahn-Ring 6, 81730 Munich, Germany, Tel.: +49 (89) 234-48682, Fax: +49 (89) 234-9554115, marcus.holmberg@infineon.com David Gelbart, ICSI, Berkeley, USA International Computer Science Institute, 1947 Center Street, Suite 600 Berkeley, CA 94704-1198, Tel.: +1 (604) 737-9898, Fax: +1 (604) 221-7250, gelbart@icsi.berkeley.edu I. I NTRODUCTION T HE accuracy of human speech recognition motivates the application of information processing strategies found in the human auditory system to automatic speech recognition (ASR) [1], [2]. The most popular feature extraction methods for ASR, Mel-Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP), already employ several principles which have known counterparts in the cochlea M. Holmberg and W. Hemmert are with Infineon Technologies, Munich, Germany. D. Gelbart is with ICSI, Berkeley, USA. and auditory nerve: frequency decomposition, mel- or bark- warping of the frequency axis, and compression of amplitudes. It therefore seems natural to consider the next processing step in the auditory periphery – synaptic adaptation in the auditory nerve. Adaptation (also known as synaptic depression) is a principal mechanism of neuronal information processing and is ubiquitous in the brain [3], [4], [5]. It accentuates signal onsets by following a high initial firing rate with a lower sustained rate. Adaptation is strong in the auditory nerve, as has been described in a number of measurements [3], [6], [7], [8]. Models of adaptation, or techniques resembling adaptation, have successfully been used in ASR. Adaptation has apparent similarities with the popular RASTA [9] technique. RASTA processing of speech is a bandpass modulation filtering, op- erating on the logarithmic spectrum. But whereas RASTA processing completely suppresses DC modulation, the auditory nerve shows a sustained firing rate to continuous stimuli. Recovery from adaptation might be partially responsible for temporal (forward) masking observed in psychoacoustic ex- periments [10]. Strope and Alwan [11] developed a model replicating psychoacoustic masking experiments with which they demonstrated ASR performance improvements, especially in noisy conditions. Seneff [12] included adaptation in her model of the auditory periphery, which was found to perform better in additive noise than a mel filter bank in [13]. Perdig˜ ao and S´ a [14] found the Seneff model to be susceptible to noise (in contrast to the finding in [13]), but found that a simplified model of synaptic adaptation generally improved recognition scores. Tchorz and Kollmeier [15] used an audi- tory model to evaluate various adaptation parameters on an ASR task. They reported higher recognition scores in additive noise for their model compared to mel-frequency cepstral coefficients (MFCC) and attributed that mainly to their joint adaptation/compression model. Accumulated knowledge of the synaptic processes of inner hair cells (e.g. [7]) has led to the evolution of fairly precise models [16], [17]. In this work, we first review the physio- logical facts and illustrate the effects of synaptic adaptation using a detailed model of auditory processing in the inner ear (Section II). We next derive a simplified model of adaptation and integrate it into conventional mel-frequency cepstral co- efficient (MFCC) feature extraction (Section III). We evaluate the resulting ASR performance using the AURORA2 and AU- RORA 3 speech recognition tasks (Section IV and Section V), in comparison to ordinary MFCCs, MFCCs processed by RASTA, and MFCCs processed by cepstral mean subtraction (CMS), and both in comparison to and in combination with Wiener filtering.