Adaptive Precision Training (ADEPT): A dynamic fixed point quantized sparsifying training approach for DNNs Lorenz Kummer 12 Kevin Sidak 13 Tabea Reichmann 14 Wilfried Gansterer 15 August 2021 Abstract Quantization is a technique for reducing deep neural networks (DNNs) training and inference times, which is crucial for training in resource constrained environ- ments or time critical inference applications. State- of-the-art (SOTA) approaches focus on post-training quantization, i.e. quantization of pre-trained DNNs for speeding up inference. Little work on quantized training exists and usually, existing approaches re- quire full precision refinement afterwards or enforce a global word length across the whole DNN. This leads to suboptimal bitwidth-to-layers assignments and re- source usage. Recognizing these limits, we introduce ADEPT, a new quantized sparsifying training strat- egy using information theory-based intra-epoch pre- cision switching to find on a per-layer basis the lowest precision that causes no quantization-induced infor- mation loss while keeping precision high enough for future learning steps to not suffer from vanishing gra- dients, producing a fully quantized DNN. Based on a bitwidth-weighted MAdds performance model, our approach achieves an average speedup of 1.26 and model size reduction of 0.53 compared to standard training in float32 with an average accuracy increase of 0.98% on AlexNet/ResNet on CIFAR10/100. 1 Faculty of Computer Science, University of Vienna 2 lorenz.kummer@univie.ac.at 3 kevin.sidak@univie.ac.at 4 tabea.reichmann@univie.ac.at 5 wilfried.gansterer@univie.ac.at 1 Introduction With the general trend in machine learning leaning towards large model sizes to solve increasingly com- plex problems, these models can be difficult to tackle in the context of inference in time critical applica- tions or training under resource and/or productivity constraints. Some applications, where a more time- and space efficient model is crucial, include robotics, augmented reality, self driving vehicles, mobile appli- cations, applications running on consumer hardware or research, needing a high number of trained models for hyper parameter optimization. Accompanied by this, DNN architectures, already some of the most common ones, like AlexNet [1] or ResNet [2], suffer from over-parameterization and overfitting. Possible solutions to the aforementioned problems in- clude pruning [3, 4, 5, 6, 7, 8], or quantization. When quantizing, the bitwitdh of the parameters is de- creased, therefore utilizing a lower precision for more efficient use of computing resources, such as memory and runtime. However, quantization has to be per- formed with a certain caution, as naive approaches or a too low bitwidth can have a negative impact on the accuracy of the network, unacceptable for most use cases. E.g. Binary Quantization [9] can effectively speed up computation, as multiplication can be per- formed as bit shifts, and further reduce memory, but the accuracy suffers heavily from this approach. Prior approaches mainly quantize the network for in- ference, or use a global bitwidth, which does not take 1 arXiv:2107.13490v3 [cs.LG] 13 Aug 2021