1 Towards predicting consonant confusions of degraded speech O. Ghitza 1 , D. Messing 2 , L. Delhorne 2 , L. Braida 2 , E. Bruckert 1 , M. Sondhi 3 1 Sensimetrics Corporation, Somerville, Massachusetts, USA 2 Massachusetts Institute of Technology, Cambridge, Massachusettes, USA 3 Avaya Research Laboratory, Basking Ridge, New Jersey, USA 1 Introduction The work described here arose from the need to understand and predict speech confusions caused by acoustic interference. Current predictors of speech intelligibility are inadequate for making such predictions (even for normal-hearing listeners). The Articulation Index, and related measures, STI and SII, are geared to predicting speech intelligibility. But such measures only predict average speech intelligibility, not error patterns, and they make predictions for only a limited set of acoustic conditions (linear filtering, reverberation, additive noise). We aim at predicting consonant confusions made by normally-hearing listeners, listening to degraded speech. Our prediction engine comprises an efferent-inspired peripheral auditory model (PAM) connected to a template-match circuit (TMC) based upon basic concepts of neural processing. The extent to which this engine is an accurate model of auditory perception will be measured by its ability to predict consonant confusions in the presence of noise. The approach we have taken involves two separate steps. First, we tune the parameters of the PAM in isolation from the TMC. We then freeze the resulting PAM and use it to tune the parameters of the TMC. In Section 2 we describe a closed-loop model of the auditory periphery that comprises a nonlinear model of the cochlea (Goldstein, 1990) with efferent-inspired feedback. To adjust the parameters of the PAM with a minimal interference of the TMC we use confusion patterns for speech segments generated in a paradigm with a minimal cognitive load (Voiers’ DRT, 1983). To further reduce PAM-TMC interaction we have synthesized DRT word-pairs, restricting stimulus differences to the initial diphones. In Section 3 we describe initial steps in a study towards predicting confusions of naturally spoken diphones, i.e. tokens that inherently exhibit phonemic variability. We describe a TMC inspired by principles of cortical neural processing (Hopfield, 2004). A desirable property of the circuit is insensitivity to time-scale variations of the input stimuli. We demonstrate the validity of this hypothesis in the context of the DRT consonant discrimination task.