Fusion of Laser and Vision for Multiple Targets Tracking via On-line Learning Xuan Song † , Huijing Zhao † , Jinshi Cui † , Xiaowei Shao ‡ , Ryosuke Shibasaki ‡ and Hongbin Zha † Abstract— Multi-target tracking becomes significantly more challenging when the targets are in close proximity or fre- quently interact with each other. This paper presents a promis- ing tracking system to deal with these problems. The novelty of this system is that laser and vision, tracking and learning are integrated and can complement each other in one framework: when the targets do not interact with each other, the laser-based independent trackers are employed and the visual information is extracted simultaneously to train some classifiers for the “possible interacting targets”. When the targets are in close proximity, the learned classifiers and visual information are used to assist in tracking. Therefore, this mode of co-operation between them not only deals with various tough problems encountered in the tracking, but also ensures that the entire process can be completely on-line and automatic. Experimental results demonstrated that laser and vision fully display their respective advantages in our system, and it is easy for us to obtain a perfect trade-off between tracking accuracy and time- cost. I. I NTRODUCTION A robust and efficient multi-target tracking system has become an urgent need in various application domains, such as surveillance, pedestrians flow analysis, intelligent transportation and many others. Compared to the traditional vision-based tracking system, as a new kind of measurement instrument, the laser range scanner has received the increas- ing attention for solving tracking problems in recent years. In a laser-based tracking system (as shown in Fig.1), the targets are represented by several points, hence the tracking become much easy and it is easy to obtain a much better performance in both accuracy and time-cost when the targets are in far apart. The system [1], [2] has been successfully applied into the JR subway station of Tokyo for the pedestrians flow analysis and reached the 83% accuracy overall. † Xuan Song, Huijing Zhao, Jinshi Cui and Hongbin Zha are with the Key Laboratory of Machine Perception (MoE), Peking University, China. E-mail: {songxuan,zhaohj,cjs,zha}@cis.pku.edu.cn. ‡ Xiaowei Shao and Ryosuke Shibasaki are with the Center for Spatial Information Science, University of Tokyo, Japan. Email: shaoxw@iis.u- tokyo.ac.jp, shiba@csis.u-tokyo.ac.jp. Fig. 1. A typical laser-based tracking system. Fig. 2. How can the persons’ correct trajectories be maintained under this condition? In frame 105, three persons were walking together. They were merging in frame 175, how could their correct trajectories be maintained when they split? However, the drawback of a laser-based tracking sys- tem is inherent and obvious: it lacks visual information, consequently it is difficult to obtain a set of features that uniquely distinguish one object from another. Hence, when the targets are in close proximity or frequently interact with each other, performing the robust tracking becomes specially challenging. Moreover, when the well-known “merge/split” condition occurs (as shown in Fig.2), maintaining the correct tracking seems to be an impossible mission. It is easy to think of fusing the laser and vision into one framework to solve these problems. Therefore, the core concerns of this research are: (1) How to make the laser and vision fully display their respective advantages in one framework to solve the tough problems encountered in multi-target tracking? (2) How to develop a tracking system that can obtain a perfect trade-off between tracking accuracy and time-cost? In this paper, we integrate laser and vision, tracking and learning and make them complement each other in one framework to deal with various tracking problems. The key idea of this work can be depicted in Fig.3 and Fig.4. When the targets do not interact with each other, the laser scanner can perform the efficient tracking and it is easy for us to extract visual information from the camera data. Due to the reliability of these tracking results, they are used as positive or negative samples to train some classifiers for the “possible interacting targets”. When the targets are in close proximity, the learned classifiers and visual information will in turn assist in tracking. This mode of co-operation between laser and vision,