Learning a Multiview Part-based Model in Virtual World for Pedestrian Detection Jiaolong Xu, David V´ azquez, Antonio M. L´ opez, Javier Marin and Daniel Ponsa Computer Vision Center Autonomous University of Barcelona Ediﬁci O, 08193 Bellaterra, Barcelona, Spain {jiaolong, dvazquez, antonio, jmarin, daniel}@cvc.uab.es Abstract— State-of-the-art deformable part-based models based on latent SVM have shown excellent results on human detection. In this paper, we propose to train a multiview deformable part-based model with automatically generated part examples from virtual-world data. The method is efﬁcient as: (i) the part detectors are trained with precisely extracted virtual examples, thus no latent learning is needed, (ii) the multiview pedestrian detector enhances the performance of the pedestrian root model, (iii) a top-down approach is used for part detection which reduces the searching space. We evaluate our model on Daimler and Karlsruhe Pedestrian Benchmarks with publicly available Caltech pedestrian detection evaluation framework and the result outperforms the state-of-the-art latent SVM V4.0, on both average miss rate and speed (our detector is ten times faster). I. INTRODUCTION Advanced Driver Assistance Systems (ADAS) aim at improving trafﬁc safety by providing warnings and perform- ing counteractive measures in dangerous situations. Reliable image-based pedestrian detection is the major challenge of a pedestrian protection system, a type of ADAS, because the pedestrians present high variability in clothes, pose, view- point and distance to camera, all under uncontrolled outdoor illumination [1]. Multiresolution pyramidal detection tries to cope with the variability in distance to the camera, proper features/descriptors address variability due to clothes, and illumination changes (e.g., HOG [2]), while multiview and part-based models focus on robustness to imaging viewpoint and pose variability respectively. Multiview and part-based models for pedestrian detection are the focus of this paper. For instance, multiview models have proven to be effective for face recognition as well as pedestrian detection. The rational behind this is that mixing views for model training turns out on more blurred models than by somehow train- ing a model per view. Training multiview models requires clustering the object examples according to the considered views. This can be done manually or automatically, thought depending on the view granularity and object complexity, it is not a trivial procedure. Shape (or contour) has been effective for viewpoint and pose clustering, and widely used for pedestrian classiﬁcation [3], [4], detection and tracking [5], direction estimation [6]. The changing pose and articulation of the limbs suggests a classiﬁer based on the integration of local image rep- Fig. 1. Virtual-world data: (a) image; (b)-(d) pedestrian examples (top) with their corresponding groundtruth mask (bottom). resentations as opposed to a holistic representation and a sub-region method has been demonstrated effective in [7]. Part-based models try to capture the idea that most objects are composed of parts and these can only be in a set of possible relative positions. Some part-based models [8], [9], [10] require manual labelling of object parts in order to perform supervised training of such models. Since obtaining manual labels is a tiresome process prone to errors, no large amounts of reliable examples are usually available for training. This can eventually limit the generality of the learnt models, and so their performance. However, the most promising method up to date for training part-based object models seems to be the structural latent SVM [11], which is giving excellent results for pedestrian detection [12] and does not require labelled body parts for performing the training of the pedestrian model. Moreover, latent SVM also includes view point as latent information, thus allowing learning more general object models. The price to pay, however, is that this method turns out to be complicated in training and sensitive to initialization. Focusing on the problem of multiview and part-based pedestrian detection, we think that one of the key issues is to have reliable labelled examples, i.e., the bounding box of the pedestrian examples as well as labelled parts. Obtaining cheap annotated pedestrians has been addressed in our previous work [13], where we generated virtual-world data using a video game. No part annotations were used,