arXiv:2103.12827v3 [cs.LG] 8 Sep 2021
Fisher Task Distance and Its Applications in Transfer Learning and Neural
Architecture Search
Cat P. Le, Mohammadreza Soltani, Juncheng Dong, Vahid Tarokh
Duke University
Durham, North Carolina, USA
{cat.le, mohammadreza.soltani, juncheng.dong, vahid.tarokh}@duke.edu
Abstract
We formulate an asymmetric (or non-commutative) distance
between tasks based on Fisher Information Matrices. We pro-
vide proof of consistency for our distance through theorems
and experiments on various classification tasks. We then ap-
ply our proposed measure of task distance in transfer learning
on visual tasks in the Taskonomy dataset. Additionally, we
show how the proposed distance between a target task and a
set of baseline tasks can be used to reduce the neural archi-
tecture search space for the target task. The complexity reduc-
tion in search space for task-specific architectures is achieved
by building on the optimized architectures for similar tasks
instead of doing a full search without using this side infor-
mation. Experimental results demonstrate the efficacy of the
proposed approach and its improvements over other methods.
Introduction
This paper is motivated by a common assumption made
in transfer and lifelong learning: similar tasks usually have
similar neural architectures. Building on this intuition, we
propose a non-commutative measure, called Fisher Task dis-
tance (FTD), which represents the complexity of transfer-
ring the knowledge of one task to another. FTD is defined
in terms of the Fisher Information matrix defined as the
second-derivative of the loss function with respect to the
parameters of the models under consideration. By defini-
tion, FTD is always greater or equal to 0, where the equality
holds if and only if it is the distance from a task to itself. To
show that our task distance is mathematically well-defined,
we provide some theoretical analysis. Moreover, we empir-
ically verify that the FTD is statistically a consistent dis-
tance through experiments on numerous tasks and datasets.
Next, we instantiate our proposed task distance on two im-
portant AI applications, Transfer Learning (TL) and Neural
Architecture Search (NAS). In particular, we demonstrate
how the FTD identifies the related tasks and utilizes the
gained knowledge for transfer learning. The experiments on
the visual tasks in Taskonomy (Zamir et al. 2018) dataset in-
dicate our computational efficiency while achieving similar
results as the brute-force approach proposed by Zamir et al.
(2018). Next, we apply the proposed task distance in the
NAS framework, learning an appropriate architecture for
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
a target task based on its similarity to other learned tasks.
For a target task, the closest task in a given set of base-
line tasks is identified and its corresponding architecture is
used to construct a neural search space for the target task
without requiring prior domain knowledge. Subsequently,
a gradient-based search algorithm called FUSE (Le et al.
2021) is applied to discover an appropriate architecture for
the target task. Briefly, the utilization of the related tasks’
architectures helps reduce the dependency on prior domain
knowledge, consequently reducing the search time for the fi-
nal architecture and improving the robustness of the search
algorithm. Extensive experimental results for the classifi-
cation tasks on MNIST (LeCun, Cortes, and Burges 2010),
CIFAR-10, CIFAR-100 (Krizhevsky, Hinton et al. 2009),
and ImageNet (Russakovsky et al. 2015) datasets demon-
strate the efficacy and superiority of our proposed approach
compared to the state-of-the-art approaches.
Related Works
The task similarity has been mainly considered in
the transfer learning (TL) literature. Similar tasks
are expected to have similar architectures as man-
ifested by the success of applying transfer learn-
ing in many applications (Silver and Bennett 2008;
Finn et al. 2016; Mihalkova, Huynh, and Mooney 2007;
Niculescu-Mizil and Caruana 2007; Luo et al. 2017;
Razavian et al. 2014; Pan and Yang 2010). However,
the main goal in TL is to transfer trained weights from
a related task to a target task. Recently, a measure of
closeness of tasks based on the Fisher Information matrix
has been used as a regularization technique in transfer
learning (Chen, Zhang, and Dong 2018) and continual
learning (Kirkpatrick et al. 2017) to prevent catastrophic
forgetting. Additionally, the task similarity has also
been investigated between visual tasks in (Zamir et al.
2018; Pal and Balasubramanian 2019; Dwivedi and Roig.
2019; Achille et al. 2019; Wang, Wehbe, and Tarr 2019;
Standley et al. 2020). These works only focus on weight-
transfer and do not utilize task similarities for discovering
the high-performing architectures. Moreover, the introduced
measures of similarity from these works are often assumed
to be symmetric which is not typically a realistic assump-
tion. For instance, consider learning a binary classification
between cat and dog images in the CIFAR-10 dataset. It is