Hand Parsing from Single Depth Image using Random Decision Forest Kartik Erappa Cholachgudda Communication Engineering Dept. VIT University Vellore, India kartikenc@gmail.com Puram Dixith Reddy Communication Engineering Dept. VIT University Vellore, India dixithreddy3216@gmail.com Lokanath M. Communication Engineering Dept. VIT University Vellore, India lokanath.m@vit.ac.in Abstract—We describe a Hand parsing algorithm which can be used for hand pose estimation, hand tracking and gesture recognition which are useful for human-computer interaction. We make use of depth camera (Creative Interactive Gesture Camera – Intel®) to acquire the hand images which gives several advantages when compared to a normal RGB optical camera. In this paper we employ an intermediate hand parsing scheme, designed so that an accurate per-pixel classification of the hand parts is obtained which can be used to localize the joints of the hand. We make use of an efficient random decision forest to classify the hand parts which in turn helps to estimate the hand pose. Simulation results was observed by varying several training parameters of the decision forest. We generally learned an efficient method which stems the basics in the development of hand pose estimation and tracking. Also we gained an intensive knowledge on Decision forests. Keywords— Decision Tree, RDF, Per-pixel Classification, Weak learner function, Entropy. I. INTRODUCTION In the last few decades, accurate estimation of hand pose has received a lot of attention in many fields like animation, gaming [19, 2], human-computer interaction, robotics [21], hand tracking systems [3, 20], gesture recognition [17, 18], sign language recognition [22], security systems and many other commercial applications [1, 5, 13]. Despite of extensive research and efforts in the field of hand pose estimation, it still remains a challenging problem, which is mainly due to the complex nature of hand articulations. The parsed hand parts are very useful high level features for hand pose estimation and gesture recognition. Hence, the first step in either hand tracking or hand pose estimation is effective classification of the hand into parts also called as Per-Pixel Classification. In our paper we attempt to classify the hand into six different parts and by using Classification forests [16] we train and test the hand images. Now let us discuss about some major problem in hand pose estimation like the use of optical cameras. In many previous studies optical camera is used as the input [17] which makes it more difficult in discriminating the hand parts in the cluttered background. As the hand is quite homogenous in colour, it will make the processing steps more complex. To overcome this problem we make use of a high speed depth sensor which has greatly simplified the task of hand parsing by providing several advantages over colour camera such as it can work in low light conditions, help remove ambiguity in scale, colour and texture, and also resolve silhouette ambiguities. The use of depth camera also helps in the pre-processing steps of our algorithm by simplifying the task of background elimination. Hand parsing scheme can be seen as a classification problem, hence we thought of making use of an efficient thus highly successful Random Decision Forest (RDF) [7, 12] to solve our classification problem. This paper is based on the classification forest which has been a core framework in the development of commercially successful Microsoft Kinect [2] gaming system for real time tracking of human body. We are employing an identical classification forest to estimate the hand pose by parsing. The aim of our research is to design a system which is robust and computationally efficient. Towards our aim this paper presents an algorithm for estimating hand pose by parsing using RDF which will classify the hand into 6 different regions, C={thumb, little finger, middle finger, palm, fore finger, ring finger} [11]. Figure 1. The labelled distribution for different hand parts Given a depth image of a hand obtained from an Intel® depth camera we wish to say which hand part each pixel belongs to. This is a typical job for a classification forest. In our algorithm we have considered 6 different hand part classes as mentioned earlier. The unit of computation here is a single image pixel and its depth feature. Initially we have to construct a tree structure based on training images and split node or weak learner function along with its parameters which will be explained in the following sections. The RDF uses simple depth comparison features which gives 3D translation invariance while maintaining high efficiency in computation