928 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002 Modeling Focus of Attention for Meeting Indexing Based on Multiple Cues Rainer Stiefelhagen, Jie Yang, Member, IEEE, and Alex Waibel, Member, IEEE Abstract—A user’s focus of attention plays an important role in human–computer interaction applications, such as a ubiquitous computing environment and intelligent space, where the user’s goal and intent have to be continuously monitored. In this paper, we are interested in modeling people’s focus of attention in a meeting situation. We propose to model participants’ focus of attention from multiple cues. We have developed a system to estimate participants’ focus of attention from gaze directions and sound sources. We employ an omnidirectional camera to simultaneously track participants’ faces around a meeting table and use neural networks to estimate their head poses. In addition, we use microphones to detect who is speaking. The system predicts participants’ focus of attention from acoustic and visual informa- tion separately. The system then combines the output of the audio- and video-based focus of attention predictors. We have evaluated the system using the data from three recorded meetings. The acoustic information has provided 8% relative error reduction on average compared to only using one modality. The focus of attention model can be used as an index for a multimedia meeting record. It can also be used for analyzing a meeting. Index Terms—Focus of attention, head pose estimation, human–computer interaction, meeting indexing, multimedia meeting record, multimodality. I. INTRODUCTION A person’s focus of attention can be visually identified in certain circumstances. Participants in a meeting, for example, might look at the speaker while they are listening to the talk. When a user is editing a paper, he/she would direct his/her visual attention would direct toward a computer screen. Modeling and tracking a person’s focus of attention is useful for many applications: Intelligent supportive computer applications could use information about a user’s focus of attention to infer the user’s mental status, his/her goals and cognitive load and adjust their own responses to the user accordingly. For multimodal human computer interaction, the user’s focus of attention can be used to determine his/her message target. For example, in interactive intelligent rooms or houses [1], [2], focus of attention could be used to determine whether the user is to control the refrigerator, the TV set, or he/she is talking to another person in the room. In other words, the user’s attention focus can be used Manuscript receivedApril 12, 2001; revised October 29, 2001. This work was supported in part by the Defense Advanced Research Projects Agency under Contract DAAD17-99-C-0061, and by the National Science Foundation under Grant IIS-9980013. R. Stiefelhagen is with the Institute for Logic, Complexity and Deduction Systems, University of Karlsruhe, Germany (e-mail: stiefel@ira.uka.de). J. Yang and A. Waibel are with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA. Publisher Item Identifier S 1045-9227(02)04429-6. to guide the environment’s “focus” to the right application and to prevent responses generated from applications that have not been addressed. During social interaction, gaze serves for several functions which are not easily transmitted by auditory cues alone [3]. In computer mediated communication systems, such as virtual collaborative workspaces, detecting and conveying participants’ gazes have several advantages: it can help the participants to determine who is talking or listening to whom, it can serve to establish joint attention during cooperative work, and it can facilitate turn taking among participants [4], [5]. In this paper, we are interested in modeling people’s focus of attention in a meeting situation. We are interested in meetings because they are one of the most common, important, and universally disliked events in our lives. Most people find it impossible to attend all relevant meetings or to retain all the salient points raised in meetings they do at- tend. Meeting records are intended to overcome these problems and extend human memories. Hand-recorded notes, however, have many drawbacks. Note-taking is time consuming, requires focus, and thus reduces one’s attention to and participation in the ensuing discussions. For this reason, notes tend to be frag- mentary and partially summarized, leaving one unsure exactly as to what was resolved, and why. At the Interactive Systems Lab of Carnegie Mellon University, we are developing a mul- timedia meeting recorder and browser to track and summarize discussions held in a specially equipped conference room [6]. The objective of the project is to provide a multimedia meeting record without using constraining devices such as headsets, hel- mets, suits, and buttons. The research issues include to identify: 1) who/what is the source of the message; 2) who or what is the target and object of the message (focus of attention); 3) what is the content of the message in the presence of jamming noise. The main components of the Meeting Browser are: a speech rec- ognizer, a summarization module, a discourse component that attempts to identify the speech acts, a module for audio–visual identification of participants [7] and a module for tracking the participants’ focus of attention. In order to quickly retrieve information from such a multi- media meeting record, we can use various indexing methods. It is well known that visual communication cues, such as gesturing, looking at each other, or monitoring each others facial expressions, play an important role during face-to-face communication [3], [8]. Therefore, to fully understand an ongoing conversation, it is necessary to capture and analyze these visual cues in addition to spoken content. Once such visual cues can be tracked, they can be used to index and retrieve recorded meetings. Queries, such as “show me all parts of the meeting, where John was telling Mary something about 1045-9227/02$17.00 © 2002 IEEE