arXiv:2003.13438v1 [cs.LG] 30 Mar 2020 O N THE U NREASONABLE E FFECTIVENESS OF K NOWLEDGE D ISTILLATION :A NALYSIS IN THE K ERNEL R EGIME –L ONG V ERSION APREPRINT Arman Rahbar Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden armanr@chalmers.se Ashkan Panahi Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden ashkan.panahi@chalmers.se Chiranjib Bhattacharyya Department of Computer Science and Automation Indian Institute of Science Karnataka, India chiru@iisc.ac.in Devdatt Dubhashi Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden dubhashi@chalmers.se Morteza Haghir Chehreghani Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden morteza.chehreghani@chalmers.se March 31, 2020 ABSTRACT Knowledge distillation (KD), i.e. one classiﬁer being trained on the outputs of another classiﬁer, is an empirically very successful technique for knowledge transfer between classiﬁers. It has even been observed that classiﬁers learn much faster and more reliably if trained with the outputs of another classiﬁer as soft labels, instead of from ground truth data. However, there has been little or no theoretical analysis of this phenomenon. We provide the ﬁrst theoretical analysis of KD in the setting of extremely wide two layer non-linear networks in model and regime in [1, 2, 3]. We prove results on what the student network learns and on the rate of convergence for the student network. Intriguingly, we also conﬁrm the lottery ticket hypothesis [4]in this model. To prove our results, we extend the repertoire of techniques from linear systems dynamics. We give corresponding experimental analysis that validates the theoretical results and yields additional insights. 1 Introduction In 2014, Hinton et al. [5] made a surprising observation: they found it easier to train classiﬁers using the real–valued outputs of another classiﬁer as target values than using actual ground–truth labels. They introduced the term knowledge distillation (or distillation for short) for this phenomenon. Since then, distillation–based training has been conﬁrmed robustly in several different types of neural networks [6, 7, 8]. It has been observed that optimization is generally more well–behaved than with label-based training, and it needs less if any regularization or speciﬁc optimization tricks. Consequently, in several ﬁelds, distillation has become a standard technique for transferring the information between classiﬁers with different architectures, such as from deep to shallow neural networks or from ensembles of classiﬁers to individual ones.