Copyright © 2012 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail permissions@acm.org. Web3D 2012, Los Angeles, CA, August 4 – 5, 2012. © 2012 ACM 978-1-4503-1432-9/12/0008 $15.00 A ﬂexible approach to gesture recognition and interaction in X3D Tobias Franke * Fraunhofer IGD, Germany Manuel Olbrich † Fraunhofer IGD, Germany Dieter W. Fellner ‡ Fraunhofer IGD & TU Darmstadt, Germany TU-Graz, Austria Abstract With the appearance of natural interaction devices such as the Mi- crosoft Kinect or Asus Xtion PRO cameras, a whole new range of interaction modes have been opened up to developers. Tracking frameworks can make use of the additional depth image or skeleton tracking capabilities to recognize gestures. A popular example of one such implementation is the NITE framework from PrimeSense, which enables ﬁne grained gesture recognition. However, recog- nized gestures come with additional information such as velocity, angle or accuracy, which are not encapsulated in a standardized format and therefore cannot be integrated into X3D in a meaningful way. In this paper, we propose a ﬂexible way to inject gesture based meta data into X3D applications to enable ﬁne grained interaction. We also discuss how to recognize these gestures if the underlying framework provides no mechanism to do so. CR Categories: I.3.1 [Input Devices] I.3.6 [Interaction tech- niques] I.3.6 [Standards] I.3.6 [Device independence] Keywords: X3D, Interaction, Kinect, OpenNI, Gesture 1 Introduction Natural interaction (NI) devices such as the Microsoft Kinect depth sensing camera or similar devices enable user and gesture recog- nition. To foster interoperability between different devices used for natural interaction, the open standard OpenNI [Ope 2010] de- ﬁnes a set of sensor-related production nodes which give the devel- oper uniﬁed access to the data gathered by the device, as well as middleware-related production nodes, which enable third-party de- velopers to insert their own modules operating on the gathered data to e.g. track scene elements, recognize gestures or produce other generic scene analysis. Other frameworks such as the Microsoft Kinect SDK [Mic 2012] do not come with an own implementation of gesture recognition and leave this task to the developer. In any case however, it is apparent that there is no uniﬁed struc- ture with which gestures are encoded. Apart from a simple string representation describing the gesture that was recognized, meta in- formation such as speed, direction, force or other gesture-speciﬁc data simply vary too much. We propose a new ﬁeld for the NI node as described in [Franke et al. 2011] to encode such information as JSON based data containers. We identify a default container with a minimal representation of a recognized gesture which can contain a number of additional data ﬁelds that help the X3D developers add more ﬁne grained controls into their applications. The beneﬁt of this paper is that X3D application developers will gain access to meta-information of tracked gestures from frame- works such as OpenNI. Whereas a simple string identiﬁer such as ”Swipe” is often times too coarse to be used meaningfully, meta- information such as a the swipe direction will support the developer * email:tobias.franke@igd.fraunhofer.de † email:manuel.olbrich@igd.fraunhofer.de ‡ e-mail: d.fellner@gris.tu-darmstadt.de in incorporating expressive gesture-based interaction. 2 Related Work Franke et al. [Franke et al. 2011] proposed a new UserSensor node, introducing support for natural interaction devices compati- ble with the OpenNI [Ope 2010] standard or other frameworks. The node contains several ﬁelds to route data generated by the frame- work into X3D space, giving the application developer the ability to track user skeletons or mask out people from the background. Another ﬁeld gesture reports any recognized gesture of a user as SFString type. Because most of the gesture recognition at this point in time is handled by a third-party framework NITE [Pri 2012] and OpenNI has no standardized way to transmit meta-information about gestures, this part was explicitly left open for future investi- gation. This paper aims to ﬁll the gap by introducing a new concept to transmit gesture meta-information for arbitrary frameworks. Tracking user gestures has been studied extensively; a wide range of algorithms exist which try to identify user gestures from differ- ent data models. Depending on the input model, these algorithms can be categorized into appearance- and model-based techniques. Model-based algorithms process 3D models such as user skeletons or volumetric information, while appearance-based algorithms con- cern themselves with 2D representations of the performed gesture such as an image-sequence. Gesture tracking systems include con- trollers such as the Nintendo Wii controller or similar hardware, wired gloves which capture ﬁnger movement or vision-based track- ing algorithms implemented in the NITE SDK[Pri 2012] or the SoftKinetic SDK[Sof 2010]. The latter have received much atten- tion since the release of the Microsoft Kinect depth sensing camera. Because gestures performed by humans vary in speed, location and accuracy, a probabilistic framework must account for time-varying gestures. The authors of [Webel et al. 2009] use a glove with pres- sure sensors and an optical marker to detect so called skills. These skills are sequences of tasks performed in a speciﬁc order and are detected with the help of a Hidden Markov Model (HMM). While a skill or task is not strictly a simple gesture as such, the general tracking implementation is of course not limited to these operations being atomic: a complicated gesture can be split up into sequences of smaller, simpler sub-gestures and be transmitted into X3D with the same interface (a string and attached meta-information) identi- fying it. [Webel et al. 2008] implement an HMM-based gesture recogni- tion system. Eight people were used to train HMM’s with ges- tures performed 20 times by each person. Each gesture is en- coded in a codebook and HMM parameters, stored in a novel node called Gesture. Another node introduced in this work called GestureRecognitionModel is fed with Gesture nodes that 171