De Paula et al. Spatial and temporal filtering of visible speech Proceedings of the 6th International Seminar on Speech Production, Sydney, December 7 to 10, 2003. page 37 LINKING PRODUCTION AND PERCEPTION THROUGH SPATIAL AND TEMPORAL FILTERING OF VISIBLE SPEECH INFORMATION Hugo de Paula 1,2 , Hani Camille Yehia 2 , Douglas Shiller 3 , Gregoire Jozan 1 , Kevin G. Munhall 3 & Eric Vatikiotis-Bateson 1,4 1 Human Information Sciences Labs, ATR, Japan 2 CEFALA, Dept. of Electronic Engineering, UFMG, Brazil 3 Psychology Dept., Queen's Univ., Kingston, Canada 4 Dept. of Linguistics, University of British Columbia, Canada ABSTRACT: This work investigates how perceivers extract phonetically relevant visual information from dynamic audiovisual speech. It tests the hypothesis that low resolution spatial and temporal information is sufficient for speech perception. Audiovisual perception studies were carried out using spatial and temporal low-pass filters applied to video image sequences for Japanese and English sentences at the rate of 30 frames/s. In the time domain, the results indicate that the information contained in the frequency band above the average rate of opening and closing the vocal tract (i.e. > 6 Hz) can be removed without significant degradation of audiovisual speech intelligibility. In the space domain, it was observed that intelligibility is not degraded if spatial frequencies below 19 cycles/face are preserved. The tests used Gaussian filters, whose monotonic smooth attenuation prevented visual artefacts. However, the lack of a flat pass-band and the wide transition band of Gaussian filters made it difficult to analyse accurately the effects of the combination of temporal and spatial filters. For this reason, Chebyshev filters, which have a sharper attenuation and a flatter pass-band have been used in the time domain to allow a more precise analysis of how audiovisual speech information is distributed in space and time. A detailed analysis of the frequency contents of video sequences also allows a deeper understanding of audiovisual speech perception. INTRODUCTION This paper addresses the issue of measuring the relevance of frequency contents of audiovisual speech in both space and time domains. The objective is to go deeper in the problem of understanding the way auditory and visual stimuli are integrated during human speech production and perception. The idea that speech production and perception occur in multiple modalities has been investigated in several studies. As examples, Sumby and Pollack (1954) have shown that being able to see the speaker's face enhances speech intelligibility; and Summerfield (1987) quantified this enhancement effect at an acoustic SNR of 8 to 10 dB. Recognizing the bi-modality of speech perception and understanding it, however, are not the same. This point has been investigated by Beskow, Dahlquist, Granström, Lundeberg, Spens and Öhman (1997), who analysed the phonetic relevance of visible orofacial events during speech-reading by normal and hearing-impaired perceivers. From the speech production point of view, Yehia, Rubin and Vatikiotis-Bateson (1998) demonstrated that speech information is distributed over the entire face, rather than concentrated on the lip region. Later, Yehia, Kuratate and Vatikiotis-Bateson (2002) have also shown that time-varying attributes of the vocal tract, face and head position are strongly correlated with spectral features of the speech acoustics. Implicit in the high degree of dependence between articulatory and acoustic domains there exist important time-varying properties. Specifically, the rate of speech production corresponds approximately to the average rate of opening and closing the vocal tract, which is less than 10 Hz (Munhall and Vatikiotis-Bateson, 1998). From the perception point of view, Vatikiotis-Bateson, Eigsti, Yano and Munhall (1998) have shown that the eyes attract a significant amount of the listener's attention even when audio is severely degraded. This can be understood as evidence for the fact that the temporal resolution necessary to acquire visual speech information is considerably below the human vision limits. Another evidence for the importance of visual information for speech communication is given by Benoît and LeGoff (1998), who have shown that speech intelligibility is improved by adding synthesized visible features, such as naturalistic lips, a skeletal jaw, or a parametric face, to the speech acoustics.