Fig. 1. Sequence of congruent multimodal references as performed by the agent. On the importance of gaze and speech alignment for efficient communication Maria Staudte 1 , Alexis Heloir 2 , Matthew Crocker 1 and Michael Kipp 2 1 Department of Computational Linguistics, Saarland University, Germany {masta,crocker}@coli.uni-saarland.de 2 DFKI, Embodied Agents Research Group, Germany firstname.surname@dfki.de Keywords: referential gaze, spoken interaction, virtual speaker, alignment Gaze as Visual Reference Gaze is known to be an important social cue in face-to-face communication indicating focus of attention. Speaker gaze can influence object perception and situated utterance comprehension by driving both interlocutors’ visual attention towards the same object; hence facilitating grounding and disambiguation [1]. The precise temporal and causal processes involved in on-line gaze-following during concurrent utterance comprehension are, however, still largely unknown. Specifically, the alignment of referential gaze and speech cues may be essential to such benefit. In this paper, we report findings from an eye-tracking study exploiting a virtual character [2] to systematically assess how speaker gaze influences listeners’ on-line comprehension. Firstly, we provide supporting evidence for the hypothesis that artificial characters in general, and our character in particular, can serve as a valuable tool to study how listeners integrate real-time gaze and speech. Secondly, our findings point to a clear benefit of speaker gaze for listeners when gaze cues and verbal references occur in identical order. In contrast, inconguent gaze is shown to have a disruptive effect on comprehension.