arXiv:2103.12827v3 [cs.LG] 8 Sep 2021 Fisher Task Distance and Its Applications in Transfer Learning and Neural Architecture Search Cat P. Le, Mohammadreza Soltani, Juncheng Dong, Vahid Tarokh Duke University Durham, North Carolina, USA {cat.le, mohammadreza.soltani, juncheng.dong, vahid.tarokh}@duke.edu Abstract We formulate an asymmetric (or non-commutative) distance between tasks based on Fisher Information Matrices. We pro- vide proof of consistency for our distance through theorems and experiments on various classiﬁcation tasks. We then ap- ply our proposed measure of task distance in transfer learning on visual tasks in the Taskonomy dataset. Additionally, we show how the proposed distance between a target task and a set of baseline tasks can be used to reduce the neural archi- tecture search space for the target task. The complexity reduc- tion in search space for task-speciﬁc architectures is achieved by building on the optimized architectures for similar tasks instead of doing a full search without using this side infor- mation. Experimental results demonstrate the efﬁcacy of the proposed approach and its improvements over other methods. Introduction This paper is motivated by a common assumption made in transfer and lifelong learning: similar tasks usually have similar neural architectures. Building on this intuition, we propose a non-commutative measure, called Fisher Task dis- tance (FTD), which represents the complexity of transfer- ring the knowledge of one task to another. FTD is deﬁned in terms of the Fisher Information matrix deﬁned as the second-derivative of the loss function with respect to the parameters of the models under consideration. By deﬁni- tion, FTD is always greater or equal to 0, where the equality holds if and only if it is the distance from a task to itself. To show that our task distance is mathematically well-deﬁned, we provide some theoretical analysis. Moreover, we empir- ically verify that the FTD is statistically a consistent dis- tance through experiments on numerous tasks and datasets. Next, we instantiate our proposed task distance on two im- portant AI applications, Transfer Learning (TL) and Neural Architecture Search (NAS). In particular, we demonstrate how the FTD identiﬁes the related tasks and utilizes the gained knowledge for transfer learning. The experiments on the visual tasks in Taskonomy (Zamir et al. 2018) dataset in- dicate our computational efﬁciency while achieving similar results as the brute-force approach proposed by Zamir et al. (2018). Next, we apply the proposed task distance in the NAS framework, learning an appropriate architecture for Copyright © 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. a target task based on its similarity to other learned tasks. For a target task, the closest task in a given set of base- line tasks is identiﬁed and its corresponding architecture is used to construct a neural search space for the target task without requiring prior domain knowledge. Subsequently, a gradient-based search algorithm called FUSE (Le et al. 2021) is applied to discover an appropriate architecture for the target task. Brieﬂy, the utilization of the related tasks’ architectures helps reduce the dependency on prior domain knowledge, consequently reducing the search time for the ﬁ- nal architecture and improving the robustness of the search algorithm. Extensive experimental results for the classiﬁ- cation tasks on MNIST (LeCun, Cortes, and Burges 2010), CIFAR-10, CIFAR-100 (Krizhevsky, Hinton et al. 2009), and ImageNet (Russakovsky et al. 2015) datasets demon- strate the efﬁcacy and superiority of our proposed approach compared to the state-of-the-art approaches. Related Works The task similarity has been mainly considered in the transfer learning (TL) literature. Similar tasks are expected to have similar architectures as man- ifested by the success of applying transfer learn- ing in many applications (Silver and Bennett 2008; Finn et al. 2016; Mihalkova, Huynh, and Mooney 2007; Niculescu-Mizil and Caruana 2007; Luo et al. 2017; Razavian et al. 2014; Pan and Yang 2010). However, the main goal in TL is to transfer trained weights from a related task to a target task. Recently, a measure of closeness of tasks based on the Fisher Information matrix has been used as a regularization technique in transfer learning (Chen, Zhang, and Dong 2018) and continual learning (Kirkpatrick et al. 2017) to prevent catastrophic forgetting. Additionally, the task similarity has also been investigated between visual tasks in (Zamir et al. 2018; Pal and Balasubramanian 2019; Dwivedi and Roig. 2019; Achille et al. 2019; Wang, Wehbe, and Tarr 2019; Standley et al. 2020). These works only focus on weight- transfer and do not utilize task similarities for discovering the high-performing architectures. Moreover, the introduced measures of similarity from these works are often assumed to be symmetric which is not typically a realistic assump- tion. For instance, consider learning a binary classiﬁcation between cat and dog images in the CIFAR-10 dataset. It is