Audiovisual Perception of Contrastive Focus in French Marion Dohen, Hélène Lœvenbruck, Marie-Agnès Cathiard & Jean-Luc Schwartz Institut de la Communication Parlée, UMR CNRS 5009, INPG, Univ. Stendhal, Grenoble, France {dohen; loeven}@icp.inpg.fr Abstract The purpose of this study is to determine whether the visual modality is useful for the perception of prosody. An audio- visual corpus was recorded from a male native French speaker. The sentences had a subject-verb-object (SVO) syntactic structure. Four contrastive focus conditions were studied: focus on each phrase (S, V or O) and no focus. Normal and reiterant modes were recorded. We first measured fundamental frequency (F0), duration and intensity to validate the corpus. Then, lip aperture and jaw opening were extracted from the video data. The articulatory analysis enabled us to suggest a set of possible visual cues to focus. These cues are a) large jaw opening gestures and high opening velocities on all the syllables of the focused phrase; b) long initial lip closure and c) hypo-articulation (reduced jaw opening and duration) of the following phrases. A perception test to see if subjects could perceive focus through the visual modality alone was developed. It showed that a) contrastive focus was well perceived visually for reiterant speech; b) no training was necessary and c) subject focus was slightly easier to identify than the other focus conditions. We also found that the presence and salience of the visual cues enhances perception. 1. Introduction 1.1. Prosody as multigestural and multimodal Prosody is crucial in speech communication. It is involved in the extraction of information such as the sentence structure, the type of speech act, or the speaker’s emotional state. It is mainly conceived of as a set of glottal and subglottal patterns resulting in variable acoustic parameters such as F0, intensity and duration. Therefore, the perceptual studies on prosody mostly deal with the auditory modality [1-6]. On the visual side, glottal and subglottal gestures per se are essentially invisible, although they might have facial movement correlates [7]. The recent data by Burnham on the visual perception of tonal contrasts is a spectacular illustration of the fact that F0 could be directly perceived by the eye [8]. In [9], it is also showed that eyebrow movements are visual cues to the perception of focus. Actually, prosody is multigestural and involves subglottal, glottal and supraglottal correlates. It displays a rich set of facial movements. Therefore, it should, without any doubt, be conceived of as multimodal. Although most studies of French prosody have focused on laryngeal and pulmonic correlates, a few supralaryngeal studies have been carried out (e.g. [10-12]). A number of possible jaw, tongue and lip correlates of prosodic patterns were suggested. These should have visible consequences. In this paper we give a description of the relationship between tonal and visual characteristics of contrastive focus in French. We then present a perceptive test to see whether focus is perceived visually. The purpose is to examine the visual prosodic cues to the perception of contrastive focus. 1.2. Background Time (s) 0.87 2.54 0 Time (s) 0.87 2.54 Pitch (Hz) 70 180 L H* L Hi L H* L Hi L% Ro main ra ni ma la jo lie ma man. Time (s) 0.9 2.7 0 Time (s) 0.9 2.7 Pitch (Hz) 70 180 L H* L Hf L% Ro main ra ni ma la jo lie ma man. Figure1: spectrogram and F0 trace for an IP including 3 APs. a. (top) unfocused case. b. (bottom) focus on the verb AP. The utterance was {[Romain] AP [ranima] AP [la jolie maman] AP } IP (Romain revived the pretty mother.). Many phonological models of the prosodic structure of French have been proposed [1,13-18]. Jun & Fougeron’s model [18, 19] was used in the present study. It agrees with most descriptions of French intonation and uses a transcription system consistent with the widely used ToBI [20]. It features two hierarchical prosodic units: the lowest is the Accentual Phrase (AP) and the highest is the Intonational Phrase (IP). The AP contains one or more content words and is right- demarcated by the primary stress (H*). An initial LHi (Low- High) tonal sequence, also called the initial or secondary accent, can mark the initial boundary of an AP. The default tonal pattern of the AP is /LHiLH*/ as realized on the second AP of Figure1a). The IP level can preempt the AP level. E.g., if an AP is IP-final, H* is replaced by the boundary tone of the IP (L% or H%) as shown in the last AP of Figure 1a). In this model, contrastive focus is considered to be marked by a strong Hf and by a low plateau on the subsequent syllables. Hf most often replaces Hi (Figure 1b), but it can also replace both Hi and H* (i.e. the rise in F0 is carried by all the syllables in the phrase and culminates on the last syllable). AVSP 2003 - International Conference on AudioĆVisual Speech Processing St. Jorioz, France September 4-7, 2003 ISCA Archive http://www.iscaĆspeech.org/archive