Real Time Body Pose Tracking in an Immersive Training Environment Chi-Wei Chu and Ramakant Nevatia Institute for Robotics and Intelligent System University of Southern California Los Angeles, CA 90089-0273 {chuc,nevatia}@usc.edu Abstract. We describe a visual communication application for a dark, theater- like interactive virtual simulation training environment. Our system visually es- timates and tracks the body position, orientation and the arm-pointing direction of the trainee. This system uses a near-IR camera array to capture images of the trainee from different angles in the dim-lighted theater. Image features like sil- houettes and intermediate silhouette body axis points are then segmented and extracted from image backgrounds. 3D body shape information such as 3D body skeleton points and visual hulls can be reconstructed from these 2D features in multiple calibrated images. We proposed a particle-ﬁltering based method that ﬁts an articulated body model to the observed image features. Currently we fo- cus on the arm-pointing gesture of either limb. From the ﬁtted articulated model we can derive the position on the screen the user is pointing to. We use current graphic hardware to accelerate the processing speed so the system is able to work in real-time. The system serves as part of multi-modal user-input device in the interactive simulation. 1 Introduction Training humans for demanding task in a simulated environment is becoming of in- creasing importance, not only to save costs but also to reduce training risk for hazardous tasks. A key issue then becomes the modalities by which the human trainee needs to communicate with the characters in the synthetic environments. Speech is one natural modality but visual communications, such as gestures and facial expressions, are also important for a seamless human-computer interaction (HCI) interface. Our objective is to achieve such communication, coupled with other modalities such as speech in the longer term. Here, we describe the ﬁrst steps of body position and pose tracking that are essential to both forms of communication. A synthetic training environment must be immersive to be effective. Instead of wear- ing head mounted displays, the user is positioned on a stage facing a large screen that display a 3D rendered virtual environment, such as a war-zone city street or a ﬁeld hospital. This environment makes visual sensing very challenging. The environment is very dark and the illumination ﬂuctuates rapidly as the scenes on the screen change. The sensing system must be passive and not interfere with communication or the dis- plays.The trainee can walk around in a limited area for natural responses and does not M. Lew et al. (Eds.): HCI 2007, LNCS 4796, pp. 146–156, 2007. c  Springer-Verlag Berlin Heidelberg 2007