The effects of temporal asynchrony on the intelligibility of accelerated speech Douglas S. Brungart , Nandini Iyer , Brian D. Simpson ,Virginie van Wassenhove Air Force Research Laboratory, Wright Patterson AFB, OH General Dynamics, Dayton, OH California Institute of Technology, Division of Biology, Pasadena CA douglas.brungart@wpafb.af.mil Abstract When the audio and visual portions of a speech stimulus are presented synchronously, the resulting enhancement in intelli- gibility is generally much larger than the one obtained when the audio and visual stimuli are presented sequentially. However, perfect synchronization is not required to obtain a substantial audiovisual (AV) benefit: many studies have shown that AV integration is maximum when the audio signal is slightly de- layed relative to the visual signal, and other studies have shown that substantial AV intelligibility enhancement typically extends over a 250+ ms range of AV delay values, from a 50 ms lag in the visual stimulus to a 200 ms lag in the audio stimulus. In this study, artificially accelerated speech stimuli were used to examine the impact that speaking rate has on the characteris- tics of this temporal integration window. The results indicate that maximal AV enhancement occurs over a progressively nar- rower range of delay values when the speaking rate increases. The results for the fastest speaking rates also show that peak AV enhancement occurred at a larger AV delay value (150ms) than has been reported in previous studies. However, there was no conclusive evidence to suggest that the audio delay value for peak AV enhancement systematically changed with the speak- ing rate of the stimulus. Index Terms: time compressed speech, AV synchrony 1. Introduction It is now well known that visual cues obtained by seeing the face of the talker can enhance the detection and the intelligibil- ity of auditory speech [1].It is also widely known that this visual enhancement of speech perception is much greater when the au- dio and visual portions of the stimulus are presented simultane- ously than when they are presented sequentially. However, the audio and visual portions of the speech signal generally do not need to be perfectly synchronous in order for AV integration to occur. Numerous psychophysical studies [2, 3, 4, 5] have shown that peak AV enhancement occurs when the audio signal is slightly delayed ( ms) relative to the visual sig- nal [5, 4, 6] and that AV intelligibility enhancement can occur over a temporal window of integration in AV speech perception which approximates the syllable unit ( 100-300ms [7, 8]) re- gardless of task or congruency of the tested speech tokens. This window of integration is also asymmetric, with a much greater tolerance for situations where the audio signal lags the visual signal (with significant AV with delays in excess of 200 ms or more) than for situations where the visual signal lags the audio stimulus (where little or no integration occurs when the audio signal leads by more than 50 ms) [9]. One aspect of AV speech perception that has not been ex- plored in detail is the impact that speaking rate has on the tem- poral integration window for AV speech perception. Naturally produced speech can vary over a wide range of speaking rates (from roughly 1-5 syl/s), and studies conducted with artificially accelerated speech have shown that normal listeners can under- stand speech accelerated by a factor of three or more [10]. It is also clear that speaking rate can influence AV speech percep- tion. Green and Miller [11] showed that the speaking rate was equivalently assessed in either visual, auditory or AV speech presentations with normal faces, but not with inverted faces, suggesting that the speaking rate in visual speech is not only ac- cessible to but also inherent to the AV speech integration mech- anism. It has also previously been suggested that the speaking rate can be fundamental in the segmental categorization of AV speech [12] and indeed, the rate of visual speech affects pho- netic categorization [11, 13] (even when the McGurk effect [14] does not take place [13], suggesting that the extent to which au- ditory and visual information integrate may be underestimated in the study of the McGurk effects). There is also evidence that both McGurk fusion and combination are enhanced dur- ing slow speech [4, 15], and that AV integration can occur over a very wide range of speaking rates including extremely fast speech stimuli. A recent study conducted in our laboratory has also shown that listeners can obtain a substantial AV benefit for speech presented in noise at rates of up to 20 syl/s [16]. While there is ample evidence that speaking rate has a sub- stantial impact on AV speech perception, very little is known about the effect changes in speaking rate might have on the temporal integration window for AV speech perception. Recent studies using brain imaging techniques with exquisite spatial (e.g. fMRI) and temporal resolution (e.g. electro- and magneto- encephalography) have shown that the neural mechanisms un- derlying auditory speech processing are tuned to key tempo- ral properties of the (auditory) speech features [17, 18] such as the syllable and the sub-phonetic features. Thus, it is quite conceivable that the audio delay required for peak AV integra- tion or the width of the temporal integration window for AV speech might similarly be influenced when the speaking rate is changed. Increases in the speaking rate also change the kine- matics of speech production, which might significantly influ- ence the temporal dynamics of AV speech perception. Within the kinematics of the face, different kinds of motion of the surface structures, velocity patterns and frequency components over a wide spectrum [19] are all likely to vary with speak- ing rate, and any or all of these could contribute differently to AV speech integration for fast and slow speech. Recently, an analysis-by-synthesis model positing a close relationship be- tween the articulatory internal representations (as distinctive features [20, 21]) in speech production/perception has been ex- Accepted after peer review of full paper Copyright 2008 AVISA 26 - 29 September 2008, Moreton Island, Australia 19