A model based upon response fields derived during early experience can account for the interference effects of synthetically degraded speech signals Susan L. Denham, Martin Coath Centre for Theoretical and Computational Neuroscience University of Plymouth, United Kingdom sdenham@plymouth.ac.uk, mcoath@plymouth.ac.uk Abstract The structure of spectrotemporal response fields (STRFs) in human auditory cortex is not known, but if they develop through early acoustic experience then it seems reasonable to suppose that speech might play a large part in their formation. We have previously shown the patterns of activity in a model of auditory processing in which cortical STRFs are derived from fragments of speech stimuli, convey significant information with respect to stimulus class (Coath, Brader et al. 2004; Coath and Denham 2004). Here we investigate whether such a model can also account for the degree of interference to ongoing speech processing caused by synthetically degraded speech signals (Brungart, Simpson et al. 2005). 1. Introduction In animals adult-like response properties in cortex develop through exposure to sounds during an early critical period (Zhang, Bao et al. 2001). The structure of spectrotemporal response fields (STRFs) in human auditory cortex is not known, but if they too develop through early acoustic experience then it seems reasonable to suppose that speech might play a large part in their formation. We investigated this hypothesis by developing a model of auditory processing in which STRFs were derived from fragments of a limited set of utterances (Coath and Denham 2004). We found that the pattern of responses across an ensemble of STRFs supported the classification of novel words and was robust to variability introduced by different speakers, sex and accents. Furthermore, the ensemble response could be interpreted in qualitatively different ways; for example, from the same response it was also possible to classify the sex and identity of the speaker, and the prosody of the word (Coath, Brader et al. 2004). The summed response of the ensemble of STRFs clearly indicates the presence of discrete events in an ongoing stream of sounds and essentially acts as an indicator of saliency. We previously showed that the ensemble response resembled cortical phase-locking to the stimulus temporal envelope as measured by (Ahissar, Nagarajan et al. 2001), and found that the correlation between intelligibility in time compressed speech and the strength of cortical phase locking was replicated by the model when STRFs with a duration of 100ms were used (Coath and Denham 2004). The idea proposed here is that the summed ensemble response might provide a measure of the degree to which any sound engages human cortex. In a recent study the degree of interference caused by various forms of degraded speech stimuli to the intelligibility of a target utterance was investigated (Brungart, Simpson et al. 2005), and it was found that the more severe the degradation the less the interference. However, the degree of interference could not simply be predicted by the intelligibility of the interferer since time-reversed speech, although unintelligible, was a very effective interferer. It was suggested that ‘speech-like fluctuations in the spectral envelope of a signal’ were important in determining the effectiveness of an interferer (Brungart, Simpson et al. 2005). One interpretation of this is that interference might be explained by the degree to which the interferer engages, and hence competes for, the same cortical processes as the target speech. We therefore investigated whether the strength of the summed ensemble response in the model might predict the effectiveness of the interferers using a set of speech stimuli and degraded forms of these stimuli, manipulated as in (Brungart, Simpson et al. 2005). 2. Method 2.1. The model The model, illustrated in figure 1, consists of 4 principal processing stages: spectral decomposition, extraction of envelope transients, convolutions of the STRF ensemble, and measurement of the summed ensemble response. Spectral decomposition: The raw waveform is processed by a 30 channel Gammatone filterbank model of cochlear processing (Slaney 1998), with centre frequencies ranging from 100 to 8000 Hz, distributed 215 Proceedings of ISCA Workshop on Plasticity in Speech Perception (PSP2005); London, UK; 15-17 June 2005