Using Winning Lottery Tickets in Transfer Learning for Convolutional Neural Networks Ryan Van Soelen Department of Computer Science Johns Hopkins University Baltimore, MD, USA rvansoe2@jhu.edu John W. Sheppard Gianforte School of Computing Montana State University Bozeman, MT, USA john.sheppard@montana.edu Abstract—Neural network pruning can be an effective method for creating more efficient networks without incurring a signif- icant penalty in accuracy. It has been shown that the topology induced by pruning after training can be used to re-train a network from scratch on the same data set, with comparable or better performance. In the context of convolutional neural networks, we build on this work to show that not only can networks be pruned to 10% of their original parameters, but that these sparse networks can also be re-trained on similar data sets with only a slight reduction in accuracy. We use the Lottery Ticket Hypothesis as the basis for our pruning method and discuss how this method can be an alternative to transfer learning, with positive initial results. This paper lays the groundwork for a transfer learning method that reduces the original network to its essential connections and does not require freezing entire layers. I. I NTRODUCTION When applied to complex problems such as image recog- nition, neural network architectures tend to be very large, requiring large amounts of computational resources for train- ing, inference, and storage. The large number of parameters also require a sufficiently large data set to properly train the network. This data and hardware burden constrains the applicability of methods based on deep learning. To remedy this issue, many researchers have turned to transfer learning, in which the learned features of a pre-trained network are applied to a new task. A common approach to transfer learning involves using frozen versions of the weights of the lower layers, while retraining the higher layers with the new data. At most, the lower layers are only adjusted through a process of fine-tuning. This work considers an alternative approach in which a large network is distilled to a smaller size, such that it can be retrained from scratch on a new but related problem using the original networks initial parameters. The basis of this approach stems from the Lottery Ticket Hypothesis, introduced by Frankle and Carbin [2]. The hy- pothesis argues that the strength of a neural network stems only from a subset of their connections. This sub-network, called the winning ticket, was by chance initialized in just the right way to allow for good convergence on the data. The authors demonstrate a way of pruning networks into these sparse winning tickets, which can be retrained to achieve a higher performance than the original network. We propose adapting this lottery ticket-based approach to transfer learning. Rather than transferring the lower-level fea- tures of a network, the winning ticket sub-network is extracted and is then retrained on the new data. This allows the important connections to be transferred from the original network, but also allows all layers to be adjusted to the new data set. We observe than the lottery ticket can be pruned to up to 10% of its initial parameters, and that retraining with these small networks can achieve comparable performance to the original architecture when trained on the new data set, depending on the severity of the pruning. Beyond improved performance of the trained network, there are many other benefits to using this ticket-transfer approach. Assuming the appropriate hardware and software implemen- tations are used, pruned networks are more efficient both in storage and in inference time. This makes the networks more applicable to hardware-limited devices such as mobile platforms or embedded systems. II. RELATED WORK A. The Lottery Ticket Hypothesis As previously stated, the foundation of this work is based on the Lottery Ticket Hypothesis [2]. Frankle and Carbin show that randomly initialized dense neural networks can contain sub-networks called winning tickets, which when trained in isolation achieve comparable or better test accuracy in a comparable number of iterations. When trained on image data, winning tickets were found in fully-connected networks, convolutional networks, Visual Geometry Group (VGG)-style networks, and Residual Neural networks (ResNets). However, in the case of VGG-style networks and ResNets, the discovery of a winning ticket was conditioned on the training hyper- parameters, meaning that some networks may not have win- ning tickets. The standard approach for finding the winning ticket is as follows [2]: 1) Randomly initialize a neural network with parameters θ 0 2) Train the network for k iterations, resulting in parame- ters θ k 3) Prune s% of the network by masking the lowest mag- nitude parameters to 0