HIERARCHICAL PART DETECTION WITH DEEP NEURAL NETWORKS Esteve Cervantes ⋆† , Long Long Yu † , Andrew D. Bagdanov ⋆ , Marc Masana ⋆ , Joost van de Weijer ⋆ † Wide Eyes Technologies, Barcelona, Spain ⋆ Computer Vision Center Barcelona, Universitat Autonoma de Barcelona, Spain ABSTRACT Part detection is an important aspect of object recognition. Most approaches apply object proposals to generate hundreds of possible part bounding box candidates which are then eval- uated by part classifiers. Recently several methods have in- vestigated directly regressing to a limited set of bounding boxes from deep neural network representation. However, for object parts such methods may be unfeasible due to their rel- atively small size with respect to the image. We propose a hierarchical method for object and part detection. In a sin- gle network we first detect the object and then regress to part location proposals based only on the feature representation inside the object. Experiments show that our hierarchical ap- proach outperforms a network which directly regresses the part locations. We also show that our approach obtains part detection accuracy comparable or better than state-of-the-art on the CUB-200 bird and Fashionista clothing item datasets with only a fraction of the number of part proposals. Index Terms— Object Recognition, Part Detection, Con- volutional Neural Networks 1. INTRODUCTION Parts are believed to be an essential part of object category models [1, 2]. Methods vary in the way they model spatial re- lations between parts, the nature of the parts (semantic or un- supervised), and the number of parts. Apart from their use for object detection [2], parts have been applied in action recog- nition [3] and fine grained detection [4, 5]. Approaches based on sliding windows have long domi- nated the field of object recognition. The ability to imple- ment these methods as a convolutional filter allows them to quickly evaluate many windows, however the number of win- dows to consider is vast. As a solution, object proposal meth- ods were developed [6, 7] which use bottom-up image analy- sis to propose a limited set of object regions. The success of object proposals has sparked its application for part-based ob- ject detection [5, 8]. In [5] the selective search object proposal This work was supported by TIN2014-52072-P and TIN2013-42795-P of the Spanish Ministry of Science. We also thank the NVIDIA Corporation for support in the form of donated GPUs. method was used to generate part proposals for bird recogni- tion. However, part detection is of different than object de- tection. In object detection, prior knowledge of the expected location and size of objects is limited and the generation of thousands of object proposals based on low-level image ev- idence is reasonable. However, parts have in general more restricted statistics especially when we consider their posi- tion with respect to the object location and size. Exploiting these restrictions on the expected position and size of the part proposals is the main objective of this paper. Alternatives using regression to directly estimate object proposals from CNN representations have been proposed [9]. This technique proposed for object detection is class agnostic and still requires hundreds of proposals per image. Regress- ing directly to parts was studied by Liang et al. [10], who directly estimate bounding boxes of clothing items given a person bounding box. Their method has the advantage that only a single bounding box per clothing item class needs to be evaluated. However, their method separates the object de- tection (in their case the human) from the part detection. In this paper we propose an end-to-end hierarchical object and part detection framework. Given a CNN representation of an image our method regresses a single object bounding box. Next, based on the CNN representation within the ob- ject bounding box we regress a single proposal for each of the parts. We train the hierarchical object and part detection network in an end-to-end fashion. To the best of our knowl- edge, we are the first to investigate such a hierarchical net- work for part detection. Our method has the advantage over object proposal methods [6, 7, 5, 8, 9] that we evaluate signif- icantly fewer bounding boxes. With respect to [10], our work integrates object and part detection in a single network. 2. TOP-DOWN PART REGRESSION In the recognition problems we consider in this paper, objects we wish to localize consist of an ensemble of sub-objects, or object parts. In the fashion recognition problems we consider in Section 3, for example, we localize clothing items (e.g. hat, glasses, boots, skirt, and handbag) present in images. These problems are often characterized by having a relatively large number of potential parts, some or most of which may not be present in a given image.