Abstract - This presentation deals with the need of a shared definition, repository of gestures (referred to as gestabulary) and of an annotation system within the robotics community. This need arises from the necessity to create a common ground on which to build effective Human Robot Interaction (HRI) systems. Over the last couple of decades, significant efforts have been made towards the development of user interfaces for human robot interaction by means of a combination of natural input modes such as visual, audio, pen, gesture, etc. These body-centered intelligent interfaces not only substitute for the common interface devices but can also be exploited to extend their functionality. While earlier systems and prototypes considered the input modes individually, it has quickly become apparent that the different modalities should be considered in combination. The rationale behind this finding is based on the evidence that each single modality can be used to leverage and to complement the semantic information delivered on each other input channel. One of the most promising interaction modes is the use of natural gestures. For the gestural interaction between a mobile system such as a robot and human users, especially visual information seems to be much relevant because it gives the system the capability to observe its operational environment in an active manner. Despite relatively successful, the use of gesture has been however rather confined to a few scenarios and application contexts. This is due to the lack of a technical definition of what a gesture is which consequently results in a lack of a classification for the different kinds of human gestures. I. ON GESTURES The keyboard has been the main input device for many years. Thereafter, the widespread introduction of mouse in the early 1980's changed the way people interacted with computers. Lately, a large number of input devices, such as those based on pen, haptic, finger movements etc., have appeared. The main impetus driving the development of new input technologies has been the demand for more natural interaction systems. Several promising user interfaces that integrate various natural interaction modes [7,12,15] and/or use tangible objects [1,2,24], have been put forward. Speech, the primary human communication mode, has been successfully integrated in several commercial and prototype systems. From voice commands [10], speech interfaces have evolved into conversational interfaces [6] which are a metaphor of a conversation modeled after human- human conversations. Also several gesture systems have been *The research presented in this paper as part of the LOCOBOT Project has been financed by the European Commission grant N° FP7 NMP 260101. Paolo Barattini (corresponding author) is with Ridgeback sas, Turin, Italy phone: +39-0172-575087 e-mail: paolo.barattini@yahoo.it. Andrea. Corradini is with the IT College of Media and Design, Copenhagen, Denmark, e-mail: andc@kea.dk proposed to date, yet we are not aware of any of them capable of reaching near-human recognition performance. In the areas of computer science and engineering, gesture recognition has been approached within a general pattern recognition framework and therefore with the same tools and techniques adopted in other research areas like speech and handwriting recognition. While speech is fundamentally a sound wave, i.e. a temporal sequence of alternating high and low pressure pulses in medium through which the wave travels, and while handwriting can be seen as a temporal sequence of ink on a 2D surface, gesture are interpreted as a set of connected spatial movements. From this perspective, a gesture is a trajectory in the 3D space, and as such it is like handwriting in a higher dimensional space. The difficulty in dealing with gesture is thus mainly due to its space-temporal variation. Similarly to speech and handwriting, intrinsic intra- and inter-personal differences can be found in the production of gesture. The same gesture usually varies when performed by a different person. Moreover, even the same person is never able to exactly reproduce a gesture. Gesture however has an additional problem of technical nature. Gesture recognition is influenced by the devices used to capture the movement that underlies the gesture as well as by the environmental conditions in which they are performed. Hand, limbs and arm tracking is the principal demand for gesture-centered applications. Users of such applications were usually required to wear a suit or glove equipped with sensors to measure their 3D position and orientation. These input devices permit to pick up very accurate input data, but are uncomfortable and cumbersome for the user to wear. Furthermore, they are not useful in any real world context in which human users happens to encounter a service or robot assistant. The same holds for many work environments in which the user is performing multiple tasks and operations and cannot be encumbered by additional devices like gloves or overalls with markers that usually even restrict the user natural movements. Camera-based input devices are much more user-friendly as they are less intrusive. Because of the hardware requirements, they represent a cheap and feasible alternative to wearable sensors. Nonetheless, they introduce some problems by their own that root in both the computational costs for real-time image processing and the difficulty of extracting 3D information from 2D images. To sense gestures with a camera is still a fragile task that usually works only in a constrained environment. The idea of using gestures and/or speech to interact with a robot has begun to emerge only during a recent period of time as in the field of robotics the most efforts have been hitherto concentrated on navigation issues. Several generation of service robots operating in commercial or service surroundings both to give support and interact with people has been employed to date [3,16] while this field of research is gaining more and more attention from the industry and the academia. Paolo Barattini and Andrea Corradini Gesture input and annotation for interactive systems*