Shape-based Pedestrian Parsing Yihang Bo School of Computer and Information Technology Beijing Jiaotong University, China yihang.bo@gmail.com Charless C. Fowlkes Department of Computer Science University of California, Irvine fowlkes@ics.uci.edu Abstract We describe a simple model for parsing pedestrians based on shape. Our model assembles candidate parts from an oversegmentation of the image and matches them to a library of exemplars. Our matching uses a hierarchical de- composition into a variable number of parts and computes scores on partial matchings in order to prune the search space of candidate segment. Simple constraints enforce consistent layout of parts. Because our model is shape- based, it generalizes well. We use exemplars from a con- trolled dataset of poses but achieve good test performance on unconstrained images of pedestrians in street scenes. We demonstrate results of parsing detections returned from a standard scanning-window pedestrian detector and use the resulting parse to perform viewpoint prediction and detec- tion re-scoring. 1. Introduction A fundamental problem in scene understanding is com- bining top-down information provided by object detection and recognition with information on object localization pro- vided by bottom-up segmentation. There has been a variety of proposals in the last 10 years for combining segmentation and recognition [32, 16, 29, 2, 15, 17], but perhaps the sim- plest approach is a feed-forward model in which candidate objects are ﬁrst detected and then each object is segmented using an object speciﬁc model. In order to provide a mech- anism for feedback, the resulting segmentations can be used to either rescore detections (see e.g., [23]) and/or combined to yield a consistent interpretation of the entire scene (see e.g., [31]). In this paper we focus on the problem of segmenting hu- man ﬁgures. Segmenting humans poses a particularly good testbed for object speciﬁc segmentation since human ﬁg- ures are highly articulated and vary widely in appearance due to clothing. One proposal for bridging the gap be- tween bottom-up segmentation and this high-level task of segmenting a heterogeneous, articulated object, is to search Figure 1. Overview of processing. A large pool of candidate seg- ments are generated by directed aggregation of superpixels. Can- didate regions are scored based on shape similarity to a database of shape exemplars. Simple constraints between parts enforce con- sistent layout (e.g. upper-body must appear above lower-body) in the labeling of regions. Assemblies with variable numbers of parts are scored using a simpliﬁed hierarchical model of appearance. over assemblies of small, bottom-up segments (also known as superpixels) in order to ﬁnd the human ﬁgure [20, 28]. The fundamental challenges to be solved in such an ap- proach are dealing with the combinatorial complexity of as- sembling a large number of potential superpixels and choos- ing appropriate scoring functions for evaluating the shape and appearance of a given assembly. We tackle both of these issues using a hierarchical description of segment shapes. This allows us to model both constraints between segments (as are captured by standard approaches to recovering artic- ulated pose [9, 24, 11]) as well as correlations in appearance between parts and sub-parts. Hierarchical composition is an appealing approach and has been explored in several vision contexts primarily for recognition (see e.g., [13, 14]). Our model is closely related 2265