Voice activity detection from gaze in video mediated communication Michal Hradis * Brno University of Technology Shahram Eivazi † Roman Bednarik ‡ University of Eastern Finland Abstract This paper discusses estimation of active speaker in multi-party video-mediated communication from gaze data of one of the par- ticipants. In the explored settings, we predict voice activity of par- ticipants in one room based on gaze recordings of a single partic- ipant in another room. The two rooms were connected by high definition, low delay audio and video links and the participants en- gaged in different activities ranging from casual discussion to sim- ple problem-solving games. We treat the task as a classification problem. We evaluate several types of features and parameter set- tings in the context of Support Vector Machine classification frame- work. The results show that using the proposed approach vocal ac- tivity of a speaker can be correctly predicted in 89 % of the time for which the gaze data are available. CR Categories: I.2.m [Artificial Intelligence]: Miscellaneous; Keywords: gaze tracking, voice activity detection, machine learn- ing, Support Vector Machines, video-mediated communication 1 Introduction Eye gaze 1 is central for grounding during communication in that gaze signals are important for collecting and providing informa- tion for mutual understanding [Clark and Brennan 1991]. While it is well established that eye-movements are a good proxy to the allocation of attention [Rayner 1998], during conversation, eye- movements also carry the information about how well the inter- locutors understand each other [Richardson et al. 2007]. Without eye contact, it is hard to engage in an efficient conversation [Argyle and Cook 1976]. In systems supporting multi-party video-mediated (MPVM) com- munication, a principal problem is presenting information from a remote location on a limited visualisation device. This challenge has to be solved by composing the information in the available screen-space and time in an appropriate way. The systems have to be able to present the remote information compactly on the screen [Jansen et al. 2011] which would be ideally achieved by per- forming automatic directorial decisions in real-time [Falelakis et al. 2011; Ursu et al. 2011] based on inferred information about current activity of the participants and current interaction state. Clearly, such systems are not available at the moment for several reasons. One reason is that for the directorial decisions to intelli- gently and effectively aid communication, diverse knowledge from ∗ e-mail:ihradis@fit.vutbr.cz † e-mail:seivazi@cs.joensuu.fi ‡ roman.bednarik@uef.fi 1 ACM, 2012. This is the authors version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ETRA ’12, March 28 - 30 2012, Santa Barbara, CA, USA Copyright 2012 ACM 978-1-4503-1221-9/12/03 several disciplines has to be combined. The involved areas of re- search include, for example, sensor-based systems, computer vi- sion, machine learning, social psychology and cinematics. It turns out that one of the required aspects is a deep understanding of at- tention during collaboration and communication. In multi-party interaction, certain information can be discarded without negative influence on understandability and naturalness of the conversation, while omitting other information can make the in- teraction incomprehensible or frustrating [Ursu et al. 2011]. In this paper we aim to broaden our understanding of attention in multi- party video-mediated communication by exploring the link between gaze and speech. We investigate the hypothesis that voice activity of participants in multi-party mediated communication can be estimated from gaze of a listener. The explored task is to estimate the voice activity – who is speaking and when 2 – of several participants simultaneously located in a single room. We carry out such analysis only based on gaze information recorded for a single remote participant. We designed a study in which a group of participants had a con- versation with another participant, remotely connected by a high- definition low latency audio and video link. Such setup is com- mon for example in business meetings, remote assistance, or on- line lecturing. The participants had known each other prior to the recordings and the recorded activities range from natural discussion about casual events to simple problem-solving games. Although the presented task is interesting by itself and it could find applications in real-time communication, we hope that this study will presents valuable insight into attention of participants in MPVM interaction. The approach we chose is based on learning discriminative Support Vector Machine (SVM) classifiers which estimate voice activity of a participant based on a feature vector extracted from a fixed time- window of gaze data. We present several types of features, their results on the dataset and analysis of the recorded gaze data. 1.1 Gaze in multi-party communication Understanding speaker activities during MPVM communication have important implications on designing any system that is able to proactively coordinate or structure communications. There are various non verbal means for detecting speaker activities implicitly. For example, analyzing the speaker head position, gaze, facial expression, gestures. Speakers use effectively gestures, facial expressions, and body posture signals to coordinate their communicative activity in conversations [Jokinen 2009; Jokinen et al. 2010]. [Rienks et al. 2010] shows how people recognize the speaker from listeners using patterns of head orientation. However, their result for speakers identification was only 43.27 % on average, which suggests that head orientation information alone is not sufficient for predicting speakers in multi-party settings. While a speaking interlocutor is likely to attract attention of the lis- teners in some way, little is known about the details of this process. [Griffin and Bock 2000] explored the time course between fixation and spoken word. Their observations show that speakers fixation 2 Voice activity is understood as any verbal and nonverbal vocal activity.