It’s Not What You Say but the Way You Say It: Matching Faces and Voices Karen Lander University of Manchester Harold Hill Advanced Telecommunications Research Institute International Miyuki Kamachi Kogakuin University Eric Vatikiotis-Bateson University of British Columbia Recent studies have shown that the face and voice of an unfamiliar person can be matched for identity. Here the authors compare the relative effects of changing sentence content (what is said) and sentence manner (how it is said) on matching identity between faces and voices. A change between speaking a sentence as a statement and as a question disrupted matching performance, whereas changing the sentence itself did not. This was the case when the faces and voices were from the same race as participants and speaking a familiar language (English; Experiment 1) or from another race and speaking an unfamiliar language (Japanese; Experiment 2). Altering manner between conversational and clear speech (Experiment 3) or between conversational and casual speech (Experiment 4) was also disruptive. However, artificially slowing (Experiment 5) or speeding (Experiment 6) speech did not affect cross- modal matching performance. The results show that bimodal cues to identity are closely linked to manner but that content (what is said) and absolute tempo are not critical. Instead, prosodic variations in rhythmic structure and/or expressiveness may provide a bimodal, dynamic identity signature. Keywords: face identity, voice identity, face and voice matching When people speak, the mechanics of speech production deter- mine not only the sound of the voice but also the movement of the face (Vatikiotis-Bateson, Munhall, Hirayama, Lee, & Terzopoulos, 1996; Yehia, Rubin, & Vatikiotis-Bateson, 1998). Evidence that humans are sensitive to, and take advantage of, this linkage comes from the important roles visual information from the face plays in speech perception. Examples include the perception of speech in noise (Sumby & Pollack, 1954), the McGurk effect (McGurk & MacDonald, 1976), silent speechreading (Bernstein, Demorest, & Tucker, 1998), and emotion perception (De Gelder & Vroomen, 2000)—including the ability to hear expressions (Auberge & Cath- iard, 2003; Tartter, 1994). In this article, we focus on the dynamic information shared by both faces and voices that provides cues to identity, even across a change in modality (Kamachi, Hill, Lander, & Vatikiotis-Bateson, 2003; Lachs & Pisoni, 2004; Rosenblum, Smith, Nichols, Lee, & Hale, 2006). It is, of course, possible to recognize the identity of a familiar person from either a static face or a voice alone. However, for faces, recent evidence suggests that movement can act as an additional visual cue to face identity (for review, see O’Toole, Roark, & Abdi, 2002). As much of this movement is associated with speech, it may be closely linked to identity information available from the voice, including for example its temporal pat- terning (Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004; Remez, Fellowes, & Rubin, 1997) and support identity matching across modalities (Kamachi et al., 2003; Lachs & Pisoni, 2004; Remez et al., 1997). In this article, we investigate the nature of these cues to identity present in both the silent movement of the face and the sound of the voice. Previous experiments investigating whether participants can match identity from a silently moving face to a voice (cross-modal matching task), or vice versa, have showed that matching is possible when the same word or sentence is used for both learning and testing. For example, Lachs and Pisoni (2004) showed that the auditory and visual components for a production of the word “cat” can be matched when played forward, but not when the stimuli are played backward. This suggests that normal speaking sequence is important and that people do not only use local cues for matching faces and voices. This was confirmed with face movement de- picted as point light stimuli by Rosenblum et al. (2006), who demonstrated identity matching for different repetitions of the same sentence. The use of point light stimuli confirms the impor- tance of time varying over spatial information for this task. Work Karen Lander, School of Psychological Sciences, University of Manchester, Manchester, England; Harold Hill, Human Information Sci- ence Laboratories, Advanced Telecommunications Research Institute In- ternational, Kyoto, Japan; Miyuki Kamachi, Faculty of Informatics, De- partment of Information Design, Kogakuin University, Tokyo, Japan; Eric Vatikiotis-Bateson, Department of Linguistics, University of British Co- lumbia, Vancouver, British Columbia, Canada. This research was supported in part by the Telecommunications Ad- vancement Organization of Japan. Thanks go to Lewis Chuang, Michelle Campbell, Rebecca Davies, Katherine Easton, and Emma Young for set- ting up and running the experiments. Thanks also go to Alice O’Toole and Vivien Tartter for useful comments on an earlier version of this article. Correspondence concerning this article should be addressed to Karen Lander, School of Psychological Sciences, University of Manchester, Ox- ford Road, Manchester M13 9PL, United Kingdom. E-mail: karen.lander@manchester.ac.uk Journal of Experimental Psychology: Copyright 2007 by the American Psychological Association Human Perception and Performance 2007, Vol. 33, No. 4, 905–914 0096-1523/07/$12.00 DOI: 10.1037/0096-1523.33.4.905 905