Leveraging Multiple Tasks to Regularize
Fine-Grained Classification
Riddhiman Dasgupta Anoop M. Namboodiri
CVIT, International Institute of Information Technology, Hyderabad, India.
riddhiman.dasgupta@research.iiit.ac.in, anoop@iiit.ac.in
Abstract—Fine-grained classification is an extremely chal-
lenging problem in computer vision, compounded by subtle
differences in shape, pose, illumination and appearance. While
convolutional neural networks have become the versatile jack-
of-all-trades tool in modern computer vision, approaches for
fine-grained recognition still rely on localization of keypoints
and parts to learn discriminative features for recognition. In
order to achieve this, most approaches use a localization module
and subsequently learn classifiers for the inferred locations,
thus necessitating large amounts of manual annotations for
bounding boxes and keypoints. In order to tackle this problem,
we aim to leverage the (taxonomic and/or semantic) relationships
present among fine-grained classes. The ontology tree is a free
source of labels that can be used as auxiliary tasks to train
a multi-task loss. Additional tasks can act as regularizers, and
increase the generalization capabilities of the network. Multiple
tasks try to take the network in diverging directions, and the
network has to reach a common minimum by adapting and
learning features common to all tasks in its shared layers.
We train a multi-task network using auxiliary tasks extracted
from taxonomical or semantic hierarchies, using a novel method
to update task-wise learning rates to ensure that the related
tasks aid and unrelated tasks does not hamper performance
on the primary task. Experiments on the popular CUB-200-
2011 dataset show that employing super-classes in an end-to-end
model improves performance, compared to methods employing
additional expensive annotations such as keypoints and bounding
boxes and/or using multi-stage pipelines.
1
I. I NTRODUCTION
Convolutional neural networks(CNNs) first tasted main-
stream success with their impressive performance on large
scale image recognition challenges, starting with Krizhevsky et
al. [1], which brought them into the limelight. Training a
convnet from scratch is usually too expensive and will not
result in the same discriminative power of one that is trained on
a large dataset like Imagenet. A far more effective strategy is
to fine-tune a convnet pre-trained on Imagenet to new datasets
and/or tasks. Consequently, researchers have adapted convnets
that were pre-trained on Imagenet for a vast plethora of tasks,
ranging from object detection and semantic segmentation to
pose estimation, depth estimation, attribute prediction, part
localization, and many more. The works by Donahue et al. [2],
Ravazian et al. [3], Chatfield et al. [4], and Oquab et al. [5]
have shown beyond any reasonable doubt that convnets are
ripe for transfer learning via fine-tuning.
The primary challenges of fine grained recognition are
large variations in pose and illumination, subtle intra-class
1
Additional details can be found at cvit.iiit.ac.in/multitaskhierarchy.
Belted
Kingfisher
Green
Kingfisher
Pied
Kingfisher
Ringed
Kingfisher
White
Breasted
Kingfisher
Megaceryle Ceryle Chloroceryle Halcyon
Alcedinidae Halcyonidae
Fig. 1. Leveraging the taxonomic ontology of birds for fine grained recogni-
tion. From top to bottom, we have family, order and species for five classes
of kingfishers in the CUB-200-2011 dataset [6]. Observe how identifying the
family or order can help identifying the class, e.g. in case of ringed kingfisher
and green kingfisher. Best viewed enlarged, in color.
differences and striking inter-class similarities. Most modern
methods for fine grained recognition rely on a combination of
localizing discriminative regions and learning corresponding
discriminative features. This in turn requires strong super-
vision such as keypoint or attribute annotations, which are
expensive and difficult to obtain at scale. On the other hand,
since fine grained recognition deals with subordinate-level
classification, there exists an implied relationships among
labels. These relationships may be taxonomical (such as super
classes) or semantic (such as attributes) in nature. The ontol-
ogy obtained in this manner contains rich latent knowledge
about finer differences between classes that can be exploited
for visual classification. The model we propose consists of a
single deep convolutional neural network, with each level of
the ontology giving rise to an additional set of labels for the
input images. These additional labels are used as auxiliary
tasks for a multi-task network, which can be trained end-
to-end using a simple weighted objective function. We also
propose a novel method to dynamically update the learning
rates (hereforth referred to as the task coefficients) for each
task in the multi-task network, based on its relatedness to the
primary task.
In this work, we analyze the utility of jointly learning multiple
related/auxiliary tasks that could regularize each other to
prevent over-fitting, while ensuring that the network retains
its discriminative capability. Much like dropout is bagging
taken to the extreme, multi-task learning is analogous to
boosting, if each task is considered a weak learner. We note
that our model can be plugged into or used in conjunction with
more complex multi-stage pipeline methods such as [7]–[10]
2016 23rd International Conference on Pattern Recognition (ICPR)
Cancún Center, Cancún, México, December 4-8, 2016
978-1-5090-4847-2/16/$31.00 ©2016 IEEE 3476