474 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 13, NO. 3, JUNE 2011
A Multi-Gesture Interaction System Using a 3-D
Iris Disk Model for Gaze Estimation and an Active
Appearance Model for 3-D Hand Pointing
Michael J. Reale, Student Member, IEEE, Shaun Canavan, Student Member, IEEE, Lijun Yin, Senior Member, IEEE,
Kaoning Hu, and Terry Hung
Abstract—In this paper, we present a vision-based human–com-
puter interaction system, which integrates control components
using multiple gestures, including eye gaze, head pose, hand
pointing, and mouth motions. To track head, eye, and mouth move-
ments, we present a two-camera system that detects the face from
a fixed, wide-angle camera, estimates a rough location for the eye
region using an eye detector based on topographic features, and
directs another active pan-tilt-zoom camera to focus in on this
eye region. We also propose a novel eye gaze estimation approach
for point-of-regard (POR) tracking on a viewing screen. To allow
for greater head pose freedom, we developed a new calibration
approach to find the 3-D eyeball location, eyeball radius, and fovea
position. Moreover, in order to get the optical axis, we create a 3-D
iris disk by mapping both the iris center and iris contour points to the
eyeball sphere. We then rotate the fovea accordingly and compute
the final, visual axis gaze direction. This part of the system permits
natural, non-intrusive, pose-invariant POR estimation from a dis-
tance without resorting to infrared or complex hardware setups. We
also propose and integrate a two-camera hand pointing estimation
algorithm for hand gesture tracking in 3-D from a distance. The
algorithms of gaze pointing and hand finger pointing are evaluated
individually, and the feasibility of the entire system is validated
through two interactive information visualization applications.
Index Terms—Gaze estimation, hand tracking, human–com-
puter interaction (HCI).
I. INTRODUCTION
T
HE ideal human-computer interaction (HCI) system
should function robustly with as few constraints as those
found in human-to-human interaction; moreover, it should map
human gestures to application control in the most natural and
intuitive way possible. Knowledge of eye gaze and point-of-re-
Manuscript received November 01, 2010; accepted January 24, 2011. Date of
publication February 28, 2011; date of current version May 18, 2011. This work
was supported in part by the Air Force Research Laboratory (FA8750-08-1-
0096), in part by the National Science Foundation (IIS-1051103), and in part by
the New York State Science and Technology Office (C050040). The associate
editor coordinating the review of this manuscript and approving it for publica-
tion was Dr. Zhengyou Zhang.
M. J. Reale, S. Canavan, L. Yin, and K. Hu are with the Department of Com-
puter Science, State University of New York at Binghamton, Binghamton, NY
13902 USA (e-mail: lijun@cs.binghamton.edu).
T. Hung is with Corning Corp., Taichung City 40763, Taiwan.
This paper has supplementary downloadable material available at http://iee-
explore.ieee.org. Part 1 of the video demo is an evaluation of both the eye gaze
tracking and hand pointing tracking. The total size is 11.08 MB. Part 2 of the
video demo shows two applications of our multi-gesture interaction system in
action. The total size is 19.97 MB.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2011.2120600
gard offers insight into the user’s intention and mental focus,
and consequently this information is vital for the next genera-
tion of user interfaces [1], [2]. Applications in this vein can be
found in the fields of HCI, security, advertising, psychology,
education, and many others [1], [3]. While eye gaze gives us a
more precise yet perhaps more transient idea of the user’s focus,
head pose gives a coarser but more committed approximation
of the user’s region of interest. As such, head pose has been
leveraged directly for coarse gaze estimation [2], video game
control [4] and navigation [5], and screen magnification [6].
Finally, hand pointing, one of the most common hand gestures,
shows us the region that the user wishes another entity to focus
on, irrespective of whether the user is actually looking at the
point himself/herself. In an HCI context, this naturally maps to
command control [7], and a few sample applications in this line
include navigation in a 3-D world [8] and TV control [9].
While many approaches use individual gestures for specific
tasks, it is uncommon to use them simultaneously. In this
paper, we propose a new algorithm for eye gaze estimation
from a distance using a 3-D eyeball/iris disk model, and we
also present a new hand pointing estimation approach in 3-D
space. We use the combination of four different gesture-based
inputs (eye gaze, head pose/position, hand pointing direction,
and mouth opening/closing) to develop a multi-gesture interac-
tion system. We demonstrate the feasibility and utility of our
system through two application case studies. One application
is the 3-D Orb File Navigator, and the other is a geographic
information visualization program (informally known as the
“Midgard viewer”). The overall system has the potential to be
extended into many different HCI applications in the educa-
tional, entertainment, business, and military sectors. We would
also argue that our system allows gesture input on a more
detailed level than that which the current wave of commercial
gesture control systems, such as the Microsoft Xbox Kinect
[10], is able to provide. Specifically, we track eye gaze and
mouth movement, and our hand tracking system incorporates a
more precise model of the hand. Moreover, we use only regular,
visible-spectrum cameras. In contrast, systems like the Kinect
are robust and reliable but at present only provide very coarse
gesture control (e.g., whole body movement); they also require
special hardware (i.e., depth cameras). It is our belief that a
combination of coarse and fine grained control will be desirable
in the next generation of gesture input device systems.
This paper is organized as follows: in Section II, we will re-
view the related work, and then we will provide an overview
1520-9210/$26.00 © 2011 IEEE