What training reveals about neural network complexity Andreas Loukas EPFL andreas.loukas@epfl.ch Marinos Poiitis Aristotle University of Thessaloniki mpoiitis@csd.auth.gr Stefanie Jegelka MIT stefje@mit.edu Abstract This work explores the Benevolent Training Hypothesis (BTH) which argues that the complexity of the function a deep neural network (NN) is learning can be deduced by its training dynamics. Our analysis provides evidence for BTH by relating the NN’s Lipschitz constant at different regions of the input space with the behavior of the stochastic training procedure. We ﬁrst observe that the Lipschitz constant close to the training data affects various aspects of the parameter trajectory, with more complex networks having a longer trajectory, bigger variance, and often veering further from their initialization. We then show that NNs whose 1st layer bias is trained more steadily (i.e., slowly and with little variation) have bounded complexity even in regions of the input space that are far from any training point. Finally, we ﬁnd that steady training with Dropout implies a training- and data- dependent generalization bound that grows poly-logarithmically with the number of parameters. Overall, our results support the intuition that good training behavior can be a useful bias towards good generalization. 1 Introduction Though neural networks (NNs) trained on relatively small datasets can generalize well, when em- ploying them on unfamiliar tasks signiﬁcant trial and error may be needed to select an architecture that does not overﬁt [1]. Could it be possible that NN designers favor architectures that can be easily trained and this biases them towards models with better generalization? In the heart of this question lies what we refer to as the “Benevolent Training Hypothesis” (BTH), which argues that the behavior of the training procedure can be used as an indicator of the complexity of the function a NN is learning. Some empirical evidence for BTH already exists: (a) It has been observed that the training is becoming more tedious for high frequency directions in the input space [2] and that low frequencies are learned ﬁrst [3]. (b) Training also slows down the more images/labels are corrupted [4], e.g., the Inception [5] architecture is 3.5× slower to train when used to predict random labels than real ones. (c) Finally, Arpit et al. [6] noticed that the loss is more sensitive with respect to speciﬁc training points when the network is memorizing data and that training slows down faster as the NN size decreases when the data contain noise. From the theory side, it is known that the training of shallow networks converges faster for more separable classes [7] and slower when ﬁtting random labels [8]. In addition, the stability [9] of stochastic gradient descent (SGD) implies that (under assumptions) NNs that can be trained with a small number of iterations provably generalize [10, 11]. Intuitively, since each gradient update conveys limited information, a NN that sees each training point few times (typically one or two) will not learn enough about the training set to overﬁt. Despite the elegance of this claim, the provided explanation does not necessarily account for what is observed in practice, where NNs trained for thousands of epochs can generalize even without rapidly decaying learning rates. 35th Conference on Neural Information Processing Systems (NeurIPS 2021). arXiv:2106.04186v2 [cs.LG] 29 Oct 2021