1 A Robust Visual Human Detection Approach with UKF Based Motion Tracking for a Mobile Robot Meenakshi Gupta, Laxmidhar Behera, Senior Member, IEEE, K. S. Venkatesh, and Mo Jamshidi, Fellow, IEEE Abstract—Robust tracking of a human in a video sequence is an essential prerequisite to an increasing number of applications where a robot needs to interact with a human user or operates in a human inhabited environment. This paper presents a robust approach that enables a mobile robot to detect and track a human using an on-board RGB-D sensor. Such robots could be used for security, surveillance and assistive robotics applications.Our approach has real-time computation power through a unique combination of new ideas and well established techniques. In the proposed method, background subtraction is combined with depth segmentation detector and template matching method to initialize the human tracking automatically. We introduce the novel concept of Head and Hand creation based on depth of interest to track the human silhouette in a dynamic environment, when the robot is moving. To make the algorithm robust, we utilize a series of detectors (e.g. height, size, shape) to distinguish target human from other objects. Because of the relatively high computation time of the silhouette matching based method, we deﬁne a conﬁdence level which allow us to use the matching based method only where it is imperative. An unscented Kalman ﬁlter (UKF) is used to predict the human location in the image frame so as to maintain the continuity of the robot motion. The efﬁcacy of the approach is demonstrated through a real experiment on a mobile robot navigating in an indoor environment. Index Terms—Human silhouette, Projection histogram, Head and Hand creation, Distance transform, Unscented Kalman ﬁlter. I. I NTRODUCTION Introducing visual tracking capabilities in artiﬁcial visual systems is one of the most active research challenges in mobile robotics. Visual tracking of non-rigid object such as a human, is an interesting research ﬁeld in mobile robotics and has received much attention in recent years because of its potential applications such as site security [1], rehabilitation in hospitals [2], [3], guidance in museums [4], assistance in ofﬁces [5], and other military applications. In such applications, a mobile robot not only needs to detect the human, but also needs to track it continuously in a dynamic environment where the usual background subtraction could not be used. It is also necessary to be able to give motion commands to the robot at regular intervals in order to maintain a continuous and smooth motion of the robot, even when the image processing may take more time than the permitted interval. In such a case, it would Manuscript received ; revised . Current version published. Meenakshi Gupta, Laxmidhar Behera and K. S. Venkatesh are with the Department of Electrical Engineering, Indian Institute of Technology, Kanpur, 208016, India (e-mail: (meenug,lbehera,venkats)@iitk.ac.in). Mo Jamshidi is with the Department of Electrical and Computer Engi- neering and ACE Center, University of Texas, San Antonio, TX 78249 USA (e-mail: moj@wacong.org). be necessary to predict the human location in the image plane based on an approximate human motion model [6]. The primary sensor used for human tracking in robotic applications is a vision sensor such as a camera [7][8]. Vision is an attractive choice as it facilitates passive sensing of the environment and provides valuable information about the scene that is unavailable through other sensors. Owing to this fact, many algorithms have been developed which detect a human in color images by extracting features such as face [9], skin color [10] , cloth color [11], and have been implemented on mobile robotic platforms. Although the algorithms developed using a single feature (e.g. face, skin or cloth color) are computationally effective, fail to detect the human robustly in dynamic environment. For example, the algorithm [12] which uses face detection for tracking a human, fails to detect human when implemented on a mobile robotic platform. As in practical scenarios, when a robot starts tracking a human, the face is actually not available to the robot. Therefore, researchers have started to combine the multiple visual features to make the human detection robust. Darrell et al. [13] combine multiple visual modalities for real-time person tracking. Depth information is extracted using a dense real-time stereo technique and is used to segment user from the background. Skin color and face detection algorithm are then applied on the segmented regions. Their algorithm assumes that the user will be nearest to the stereo and human face is visible to the robot. Gaverila [14] presented a multi- cue vision system for the real-time detection and tracking of pedestrians from a moving vehicle. The algorithm integrates the consecutive modules such as stereo-based ROI generation, shape-based detection, texture-based classiﬁcation and stereo- based veriﬁcation. The algorithm has high computation time as stereo is used for disparity map generation. In literature, the human detection algorithm developed by Navneent et al. [15] is found to be the most robust algorithm. They have used Histograms of Oriented Gradient (HOG) descriptors and a SVM classiﬁer to detect the human. Although the algorithm is robust, its high computation time limits its application for real-time systems. In [16], Liyuan et al. integrates multiple vision models for robust human detection and tracking. They combined the HOG based and stereo-based human detection through mean-shift tracking. Combining the multiple vision models makes the human detection robust but simultaneously increases the computation cost of the system. To meet the real-time requirements of human tracking, most existing systems employ either a laser sensor or combine the laser sensor information with the color camera information. Woojin et al. [17] proposed the detection