ELECTROPHYSIOLOGY OF UNIMODAL AND AUDIOVISUAL SPEECH PERCEPTION Lynne E. Bernstein 1 , Curtis W. Ponton 2 , Edward T. Auer, Jr. 1 1 House Ear Institute, 2100 W. Third St., Los Angeles, CA 90057. 2 Neuroscan Labs, 5700 Cromo Dr. – STE 100, El Paso, TX 79912. ABSTRACT Based on behavioral evidence, audiovisual speech perception is generally thought to proceed linearly from initial unimodal perceptual processing to integration of the unimodally processed information. We investigated unimodal versus audiovisual speech processing using electrical event-related potentials (ERPs) obtained from twelve adults. Nonsense syllable stimuli were presented in an oddball paradigm to evoke the mismatch negativity (MMN). Conditions were (1) audiovisual incongruent stimuli (visual /ga/ + auditory /ba/) versus congruent audiovisual stimuli (visual /ba/ + auditory /ba/), (2) visual-only stimuli from the audiovisual condition (/ga/ vs. /ba/), and (3) auditory-only stimuli (/ba/ vs. /da/). A visual-alone MMN was obtained on occipital and temporo-parietal electrodes, and the classical auditory MMN was obtained at the vertex electrode, Cz. Under audiovisual conditions, the negativity recorded at the occipital electrode locations was reduced in amplitude and latency compared to that recorded in the visual-only condition. Also, under the audiovisual condition, the vertex electrode showed a smaller negativity with increased latency relative to the auditory MMN. The neurophysiological evidence did not support a simple bottom-up linear flow from unimodal processing to audiovisual integration. 1. INTRODUCTION 2. Speech perception is a process that transforms speech signals into the neural representations that are then projected onto word-form representations in the mental lexicon. Phonetic perception is more narrowly defined as the perceptual processing of the linguistically relevant attributes of the physical (measurable) speech signals. How acoustic and optical phonetic speech signals are processed under unimodal conditions and integrated under audiovisual conditions are fundamental questions for behavioral and brain sciences. The McGurk effect [1] has been used as a tool in investigating this question behaviorally. An example of the McGurk effect is when a visible spoken token [ga] is presented synchronously with an audible [ba], frequently resulting in the reported percept /da/. Various experimental results have led to the McGurk effect being attributed to, and viewed as evidence for, early bottom-up perceptual integration of phonetic information. For example, selectively attending to one modality only does not abolish the effect [2], suggesting that cognitive top-down control is not possible. Gender incongruency and knowledge about phonemic incongruency between auditory versus visual stimuli does not abolish it [3, 4], suggesting a high degree of bottom-up automaticity. Auditory phonetic distinctions, for example voicing, are affected by visual syllables [5], and phonetic goodness judgments are affected by visual syllables [6], suggesting that integration is subphonemic. McGurk percepts do not adapt auditory stimuli [7], suggesting that audiovisual integration strictly follows auditory phonetic processing. These observations are consistent with the theory that there is a bottom-up flow of unimodal processing that precedes audiovisual integration. There are also results inconsistent with the strictly bottom-up flow of unimodal information followed by integration. For example, asynchrony of audiovisual stimuli of approximately 180 ms does not abolish McGurk effects [cf., 8, 9], suggesting that perceptual information can be maintained in memory and then integrated. Reductions in effect strength have also been shown to occur due to training [2] and to talker familiarity [10], both of which may be related to post-perceptual processes. THE CURRENT STUDY We investigated the processing of unimodal auditory and visual, and audiovisual speech stimuli using recordings of electrical event-related potentials (ERPs) obtained in an oddball, mismatch negativity (MMN) paradigm. ERPs afford neurophysiological measures of brain activity with high temporal resolution (< 1 ms) and with moderately good spatial resolution (< 10 mm). These measures of thalamic and cortical brain activity are presumed to reflect mostly excitatory post-synaptic potentials arising from large populations of pyramidal cells oriented in a common direction [12-14]. ERPs are often classified as exogenous (reflecting physical AVSP 2001 International Conference on Auditory-Visual Speech Processing 104