Copyright © 2012 by the Association for Computing Machinery, Inc.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for commercial advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
permissions@acm.org.
Web3D 2012, Los Angeles, CA, August 4 – 5, 2012.
© 2012 ACM 978-1-4503-1432-9/12/0008 $15.00
A flexible approach to gesture recognition and interaction in X3D
Tobias Franke
*
Fraunhofer IGD, Germany
Manuel Olbrich
†
Fraunhofer IGD, Germany
Dieter W. Fellner
‡
Fraunhofer IGD & TU Darmstadt, Germany
TU-Graz, Austria
Abstract
With the appearance of natural interaction devices such as the Mi-
crosoft Kinect or Asus Xtion PRO cameras, a whole new range of
interaction modes have been opened up to developers. Tracking
frameworks can make use of the additional depth image or skeleton
tracking capabilities to recognize gestures. A popular example of
one such implementation is the NITE framework from PrimeSense,
which enables fine grained gesture recognition. However, recog-
nized gestures come with additional information such as velocity,
angle or accuracy, which are not encapsulated in a standardized
format and therefore cannot be integrated into X3D in a meaningful
way. In this paper, we propose a flexible way to inject gesture based
meta data into X3D applications to enable fine grained interaction.
We also discuss how to recognize these gestures if the underlying
framework provides no mechanism to do so.
CR Categories: I.3.1 [Input Devices] I.3.6 [Interaction tech-
niques] I.3.6 [Standards] I.3.6 [Device independence]
Keywords: X3D, Interaction, Kinect, OpenNI, Gesture
1 Introduction
Natural interaction (NI) devices such as the Microsoft Kinect depth
sensing camera or similar devices enable user and gesture recog-
nition. To foster interoperability between different devices used
for natural interaction, the open standard OpenNI [Ope 2010] de-
fines a set of sensor-related production nodes which give the devel-
oper unified access to the data gathered by the device, as well as
middleware-related production nodes, which enable third-party de-
velopers to insert their own modules operating on the gathered data
to e.g. track scene elements, recognize gestures or produce other
generic scene analysis. Other frameworks such as the Microsoft
Kinect SDK [Mic 2012] do not come with an own implementation
of gesture recognition and leave this task to the developer.
In any case however, it is apparent that there is no unified struc-
ture with which gestures are encoded. Apart from a simple string
representation describing the gesture that was recognized, meta in-
formation such as speed, direction, force or other gesture-specific
data simply vary too much. We propose a new field for the NI node
as described in [Franke et al. 2011] to encode such information as
JSON based data containers. We identify a default container with a
minimal representation of a recognized gesture which can contain
a number of additional data fields that help the X3D developers add
more fine grained controls into their applications.
The benefit of this paper is that X3D application developers will
gain access to meta-information of tracked gestures from frame-
works such as OpenNI. Whereas a simple string identifier such as
”Swipe” is often times too coarse to be used meaningfully, meta-
information such as a the swipe direction will support the developer
*
email:tobias.franke@igd.fraunhofer.de
†
email:manuel.olbrich@igd.fraunhofer.de
‡
e-mail: d.fellner@gris.tu-darmstadt.de
in incorporating expressive gesture-based interaction.
2 Related Work
Franke et al. [Franke et al. 2011] proposed a new UserSensor
node, introducing support for natural interaction devices compati-
ble with the OpenNI [Ope 2010] standard or other frameworks. The
node contains several fields to route data generated by the frame-
work into X3D space, giving the application developer the ability
to track user skeletons or mask out people from the background.
Another field gesture reports any recognized gesture of a user as
SFString type. Because most of the gesture recognition at this
point in time is handled by a third-party framework NITE [Pri 2012]
and OpenNI has no standardized way to transmit meta-information
about gestures, this part was explicitly left open for future investi-
gation. This paper aims to fill the gap by introducing a new concept
to transmit gesture meta-information for arbitrary frameworks.
Tracking user gestures has been studied extensively; a wide range
of algorithms exist which try to identify user gestures from differ-
ent data models. Depending on the input model, these algorithms
can be categorized into appearance- and model-based techniques.
Model-based algorithms process 3D models such as user skeletons
or volumetric information, while appearance-based algorithms con-
cern themselves with 2D representations of the performed gesture
such as an image-sequence. Gesture tracking systems include con-
trollers such as the Nintendo Wii controller or similar hardware,
wired gloves which capture finger movement or vision-based track-
ing algorithms implemented in the NITE SDK[Pri 2012] or the
SoftKinetic SDK[Sof 2010]. The latter have received much atten-
tion since the release of the Microsoft Kinect depth sensing camera.
Because gestures performed by humans vary in speed, location and
accuracy, a probabilistic framework must account for time-varying
gestures. The authors of [Webel et al. 2009] use a glove with pres-
sure sensors and an optical marker to detect so called skills. These
skills are sequences of tasks performed in a specific order and are
detected with the help of a Hidden Markov Model (HMM). While
a skill or task is not strictly a simple gesture as such, the general
tracking implementation is of course not limited to these operations
being atomic: a complicated gesture can be split up into sequences
of smaller, simpler sub-gestures and be transmitted into X3D with
the same interface (a string and attached meta-information) identi-
fying it.
[Webel et al. 2008] implement an HMM-based gesture recogni-
tion system. Eight people were used to train HMM’s with ges-
tures performed 20 times by each person. Each gesture is en-
coded in a codebook and HMM parameters, stored in a novel node
called Gesture. Another node introduced in this work called
GestureRecognitionModel is fed with Gesture nodes that
171