arXiv:2003.13438v1 [cs.LG] 30 Mar 2020 O N THE U NREASONABLE E FFECTIVENESS OF K NOWLEDGE D ISTILLATION :A NALYSIS IN THE K ERNEL R EGIME –L ONG V ERSION APREPRINT Arman Rahbar Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden armanr@chalmers.se Ashkan Panahi Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden ashkan.panahi@chalmers.se Chiranjib Bhattacharyya Department of Computer Science and Automation Indian Institute of Science Karnataka, India chiru@iisc.ac.in Devdatt Dubhashi Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden dubhashi@chalmers.se Morteza Haghir Chehreghani Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden morteza.chehreghani@chalmers.se March 31, 2020 ABSTRACT Knowledge distillation (KD), i.e. one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. However, there has been little or no theoretical analysis of this phenomenon. We provide the first theoretical analysis of KD in the setting of extremely wide two layer non-linear networks in model and regime in [1, 2, 3]. We prove results on what the student network learns and on the rate of convergence for the student network. Intriguingly, we also confirm the lottery ticket hypothesis [4]in this model. To prove our results, we extend the repertoire of techniques from linear systems dynamics. We give corresponding experimental analysis that validates the theoretical results and yields additional insights. 1 Introduction In 2014, Hinton et al. [5] made a surprising observation: they found it easier to train classifiers using the real–valued outputs of another classifier as target values than using actual ground–truth labels. They introduced the term knowledge distillation (or distillation for short) for this phenomenon. Since then, distillation–based training has been confirmed robustly in several different types of neural networks [6, 7, 8]. It has been observed that optimization is generally more well–behaved than with label-based training, and it needs less if any regularization or specific optimization tricks. Consequently, in several fields, distillation has become a standard technique for transferring the information between classifiers with different architectures, such as from deep to shallow neural networks or from ensembles of classifiers to individual ones.