Leveraging Multiple Tasks to Regularize Fine-Grained Classiﬁcation Riddhiman Dasgupta Anoop M. Namboodiri CVIT, International Institute of Information Technology, Hyderabad, India. riddhiman.dasgupta@research.iiit.ac.in, anoop@iiit.ac.in Abstract—Fine-grained classiﬁcation is an extremely chal- lenging problem in computer vision, compounded by subtle differences in shape, pose, illumination and appearance. While convolutional neural networks have become the versatile jack- of-all-trades tool in modern computer vision, approaches for ﬁne-grained recognition still rely on localization of keypoints and parts to learn discriminative features for recognition. In order to achieve this, most approaches use a localization module and subsequently learn classiﬁers for the inferred locations, thus necessitating large amounts of manual annotations for bounding boxes and keypoints. In order to tackle this problem, we aim to leverage the (taxonomic and/or semantic) relationships present among ﬁne-grained classes. The ontology tree is a free source of labels that can be used as auxiliary tasks to train a multi-task loss. Additional tasks can act as regularizers, and increase the generalization capabilities of the network. Multiple tasks try to take the network in diverging directions, and the network has to reach a common minimum by adapting and learning features common to all tasks in its shared layers. We train a multi-task network using auxiliary tasks extracted from taxonomical or semantic hierarchies, using a novel method to update task-wise learning rates to ensure that the related tasks aid and unrelated tasks does not hamper performance on the primary task. Experiments on the popular CUB-200- 2011 dataset show that employing super-classes in an end-to-end model improves performance, compared to methods employing additional expensive annotations such as keypoints and bounding boxes and/or using multi-stage pipelines. 1 I. I NTRODUCTION Convolutional neural networks(CNNs) ﬁrst tasted main- stream success with their impressive performance on large scale image recognition challenges, starting with Krizhevsky et al. [1], which brought them into the limelight. Training a convnet from scratch is usually too expensive and will not result in the same discriminative power of one that is trained on a large dataset like Imagenet. A far more effective strategy is to ﬁne-tune a convnet pre-trained on Imagenet to new datasets and/or tasks. Consequently, researchers have adapted convnets that were pre-trained on Imagenet for a vast plethora of tasks, ranging from object detection and semantic segmentation to pose estimation, depth estimation, attribute prediction, part localization, and many more. The works by Donahue et al. [2], Ravazian et al. [3], Chatﬁeld et al. [4], and Oquab et al. [5] have shown beyond any reasonable doubt that convnets are ripe for transfer learning via ﬁne-tuning. The primary challenges of ﬁne grained recognition are large variations in pose and illumination, subtle intra-class 1 Additional details can be found at cvit.iiit.ac.in/multitaskhierarchy. Belted Kingfisher Green Kingfisher Pied Kingfisher Ringed Kingfisher White Breasted Kingfisher Megaceryle Ceryle Chloroceryle Halcyon Alcedinidae Halcyonidae Fig. 1. Leveraging the taxonomic ontology of birds for ﬁne grained recogni- tion. From top to bottom, we have family, order and species for ﬁve classes of kingﬁshers in the CUB-200-2011 dataset [6]. Observe how identifying the family or order can help identifying the class, e.g. in case of ringed kingﬁsher and green kingﬁsher. Best viewed enlarged, in color. differences and striking inter-class similarities. Most modern methods for ﬁne grained recognition rely on a combination of localizing discriminative regions and learning corresponding discriminative features. This in turn requires strong super- vision such as keypoint or attribute annotations, which are expensive and difﬁcult to obtain at scale. On the other hand, since ﬁne grained recognition deals with subordinate-level classiﬁcation, there exists an implied relationships among labels. These relationships may be taxonomical (such as super classes) or semantic (such as attributes) in nature. The ontol- ogy obtained in this manner contains rich latent knowledge about ﬁner differences between classes that can be exploited for visual classiﬁcation. The model we propose consists of a single deep convolutional neural network, with each level of the ontology giving rise to an additional set of labels for the input images. These additional labels are used as auxiliary tasks for a multi-task network, which can be trained end- to-end using a simple weighted objective function. We also propose a novel method to dynamically update the learning rates (hereforth referred to as the task coefﬁcients) for each task in the multi-task network, based on its relatedness to the primary task. In this work, we analyze the utility of jointly learning multiple related/auxiliary tasks that could regularize each other to prevent over-ﬁtting, while ensuring that the network retains its discriminative capability. Much like dropout is bagging taken to the extreme, multi-task learning is analogous to boosting, if each task is considered a weak learner. We note that our model can be plugged into or used in conjunction with more complex multi-stage pipeline methods such as [7]–[10] 2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 978-1-5090-4847-2/16/$31.00 ©2016 IEEE 3476