SIMILARITY STRUCTURE IN VISUAL PHONETIC PERCEPTION AND OPTICAL PHONETICS Lynne E. Bernstein 1 , Jintao Jiang 2 , Abeer Alwan 2 , and Edward T. Auer, Jr. 1 1 House Ear Institute, 2100 W. Third St., Los Angeles, CA 90057. 2 Department of Electrical Engineering, University of California at Los Angeles, CA 90095. ABSTRACT This study was undertaken to examine relationships between the similarity structures of optical phonetic measures and visual phonetic perception. For this study, four talkers who varied in visual intelligibility were recorded simultaneously with a 3-dimensional optical recording system and a video camera. Subjects perceptually identified the talkers’ consonant-vowel nonsense syllable utterances in a forced-choice identification task. Then, perceptual confusion matrices were analyzed using multidimensional scaling, and Euclidean distances among stimulus phonemes were obtained. Physical Euclidean distances between phonemes were computed on the raw 3-dimensional optical recordings for the phonemes used in the perceptual testing. Multilinear regression was used to generate a transformation vector between physical and perceptual distances. Then, correlations were computed between transformed physical and perceptual distances. These correlations ranged between .77 and .81 (59% and 66% variance accounted for), depending on the vowel context. This study showed that the relatively raw representations of the physical stimuli were effective in accounting for visual speech perception, a result consistent with the hypothesis that perceptual representations and similarity structures for visual speech are modality-specific. 1. INTRODUCTION A working definition for speech perception is that it is a process in which speech signals are transformed into the neural representations that are then projected onto word-form representations in the mental lexicon. Phonetic perception is more narrowly defined as the perceptual processing of the linguistically relevant attributes of the physical (measurable) speech signals. Understanding of phonetic perception requires determining the relationship between physical stimulus attributes and perceptual (or neural) consequences. However, very frequently, visual speech stimuli in perception experiments are described only in terms of the gender and language of the talker, how the recordings were made, and the linguistic content of the utterances (phonemes, words, sentences, etc.) [1], not any of the optical phonetic characteristics. The reasons for this might be that until recently speech researchers used primarily acoustic stimuli, and speech perception has been viewed as primarily an auditory function. Explanations for audiovisual and visual-only speech perception have appealed to various theoretical mechanisms such as a common amodal metric [2], a common articulatory representation [3], and abstract features [4] to explain the visual aspects of speech perception, apparently obviating characterization of optical phonetic signals. However, an alternative theory is that visual speech perception relies on modality-specific phonetic processing. If so, the relationship between optical speech signals and visual speech perception needs focused attention. One aspect of this relationship could be due to the perceptually primary processing of overall stimulus similarity [5]. This study investigated the relationship between visual perceptual and physical similarity. Perceptual similarity. The most frequently noted characteristic of optical phonetic stimuli is that segmental dissimilarity is reduced relative to that obtained under good listening conditions with acoustic phonetic stimuli. Fairly systematic, although far from invariant, clusters of confusions among visual speech segments are regularly observed. For example, [m b p] are highly confused by perceivers. Such groupings of perceptually similar segments have come to be regarded as perceptual categories [e.g., 4], frequently referred to as visemes. Visemes have also come to be generally regarded as having no internal perceptual structure. We have adopted the term phoneme equivalence class [PEC] as a generalization of the viseme concept, but one that covers a range of quantitatively defined similarity relationships AVSP 2001 International Conference on Auditory-Visual Speech Processing 50