928 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002
Modeling Focus of Attention for Meeting Indexing
Based on Multiple Cues
Rainer Stiefelhagen, Jie Yang, Member, IEEE, and Alex Waibel, Member, IEEE
Abstract—A user’s focus of attention plays an important role in
human–computer interaction applications, such as a ubiquitous
computing environment and intelligent space, where the user’s
goal and intent have to be continuously monitored. In this paper,
we are interested in modeling people’s focus of attention in a
meeting situation. We propose to model participants’ focus of
attention from multiple cues. We have developed a system to
estimate participants’ focus of attention from gaze directions
and sound sources. We employ an omnidirectional camera to
simultaneously track participants’ faces around a meeting table
and use neural networks to estimate their head poses. In addition,
we use microphones to detect who is speaking. The system predicts
participants’ focus of attention from acoustic and visual informa-
tion separately. The system then combines the output of the audio-
and video-based focus of attention predictors. We have evaluated
the system using the data from three recorded meetings. The
acoustic information has provided 8% relative error reduction
on average compared to only using one modality. The focus of
attention model can be used as an index for a multimedia meeting
record. It can also be used for analyzing a meeting.
Index Terms—Focus of attention, head pose estimation,
human–computer interaction, meeting indexing, multimedia
meeting record, multimodality.
I. INTRODUCTION
A
person’s focus of attention can be visually identified
in certain circumstances. Participants in a meeting, for
example, might look at the speaker while they are listening to
the talk. When a user is editing a paper, he/she would direct
his/her visual attention would direct toward a computer screen.
Modeling and tracking a person’s focus of attention is useful for
many applications: Intelligent supportive computer applications
could use information about a user’s focus of attention to infer the
user’s mental status, his/her goals and cognitive load and adjust
their own responses to the user accordingly. For multimodal
human computer interaction, the user’s focus of attention can
be used to determine his/her message target. For example, in
interactive intelligent rooms or houses [1], [2], focus of attention
could be used to determine whether the user is to control the
refrigerator, the TV set, or he/she is talking to another person in
the room. In other words, the user’s attention focus can be used
Manuscript receivedApril 12, 2001; revised October 29, 2001. This work was
supported in part by the Defense Advanced Research Projects Agency under
Contract DAAD17-99-C-0061, and by the National Science Foundation under
Grant IIS-9980013.
R. Stiefelhagen is with the Institute for Logic, Complexity and Deduction
Systems, University of Karlsruhe, Germany (e-mail: stiefel@ira.uka.de).
J. Yang and A. Waibel are with the School of Computer Science, Carnegie
Mellon University, Pittsburgh, PA 15213 USA.
Publisher Item Identifier S 1045-9227(02)04429-6.
to guide the environment’s “focus” to the right application and
to prevent responses generated from applications that have not
been addressed. During social interaction, gaze serves for several
functions which are not easily transmitted by auditory cues
alone [3]. In computer mediated communication systems, such
as virtual collaborative workspaces, detecting and conveying
participants’ gazes have several advantages: it can help the
participants to determine who is talking or listening to whom, it
can serve to establish joint attention during cooperative work,
and it can facilitate turn taking among participants [4], [5].
In this paper, we are interested in modeling people’s focus
of attention in a meeting situation.
We are interested in meetings because they are one of the most
common, important, and universally disliked events in our lives.
Most people find it impossible to attend all relevant meetings
or to retain all the salient points raised in meetings they do at-
tend. Meeting records are intended to overcome these problems
and extend human memories. Hand-recorded notes, however,
have many drawbacks. Note-taking is time consuming, requires
focus, and thus reduces one’s attention to and participation in
the ensuing discussions. For this reason, notes tend to be frag-
mentary and partially summarized, leaving one unsure exactly
as to what was resolved, and why. At the Interactive Systems
Lab of Carnegie Mellon University, we are developing a mul-
timedia meeting recorder and browser to track and summarize
discussions held in a specially equipped conference room [6].
The objective of the project is to provide a multimedia meeting
record without using constraining devices such as headsets, hel-
mets, suits, and buttons. The research issues include to identify:
1) who/what is the source of the message; 2) who or what is the
target and object of the message (focus of attention); 3) what
is the content of the message in the presence of jamming noise.
The main components of the Meeting Browser are: a speech rec-
ognizer, a summarization module, a discourse component that
attempts to identify the speech acts, a module for audio–visual
identification of participants [7] and a module for tracking the
participants’ focus of attention.
In order to quickly retrieve information from such a multi-
media meeting record, we can use various indexing methods.
It is well known that visual communication cues, such as
gesturing, looking at each other, or monitoring each others
facial expressions, play an important role during face-to-face
communication [3], [8]. Therefore, to fully understand an
ongoing conversation, it is necessary to capture and analyze
these visual cues in addition to spoken content. Once such
visual cues can be tracked, they can be used to index and
retrieve recorded meetings. Queries, such as “show me all parts
of the meeting, where John was telling Mary something about
1045-9227/02$17.00 © 2002 IEEE