Optimization of GPU and CPU Acceleration for Neural Networks Layers Implemented in Python Radu Dogaru, Ioana Dogaru Natural Computing Laboratory, Dept. of Applied Electronics and Information Eng. University “Politehnica” of Bucharest, Romania radu_d@ieee.org, ioana.dogaru@upb.ro Abstract—Many neural architectures including RBF, SVM, FSVC classifiers, or deep-learning solutions require the efficient implementation of neurons layers, each of them having a given number of m neurons, a specific set of parameters and operating on a training or test set of N feature vectors having each a dimension n. Herein we investigate how to allocate the computation on GPU kernels and how to better optimize the problem parameters (neural structure and training set size) as well as the GPU parameters in order to maximize the acceleration (relative to a CPU implementation). It is shown that by maximizing the load (number of threads on each computational GPU core) and by a proper allocation of the GPU global memory, very large speedups (100-250 times) with respect to the CPU implementation can be achieved while using the convenient NUMBA Python package supporting CUDA programming of GPU. Consequently, it is shown that given a problem to be posed to a neural network a convenient decomposition of the network can be done in order to allocate optimally the parts of the computation to the GPU in order to maximize efficiency. Also, for CPU implementations it was found that Intel’s MKL library (called from NUMPY package) can offer efficient implementation of neural layers, comparable to what is achieved using GPU. Keywords — neural networks; high performance computing; graphical processing unit (GPU), radial basis functions. I. INTRODUCTION Numerous practical problems arise where big data (images, complex signals, time series, etc) have to be automatically classified using neural networks or other similar constructs. Since computations in such networks involve billion of basic arithmetic operations it is worth investigating how to make this computation efficient, particularly when relatively cheap GPUs are readily available on most actual computational platforms. Numerous authors investigated this aspect in the last years, proposing various solutions [1][2]. Usually speedups (over CPU implementations) of up to 1 order of magnitude are reported for commonly available GPU [1][2]. There are many possibilities to map computations associated to neural layers into GPU, for instance by allowing threads (distributed to the multiple GPU cores) to execute synaptic multiplications as discussed in [3] where a need for “map-reduce” unit to collect results from all parallel threads was identified as a limiting factor leading to the proposal of a specialized hardware and its FPGA implementation [4]. While this solution proves fast and efficient in terms of power and acceleration, our focus herein is on widely available GPUs. Herein we describe a methodology and results for optimizing the efficiency while aiming to get the most from a given GPU unit. Particularly, we discuss a “high productivity” solution, namely the implementation in Python 2.7, using the NUMBA package from the Anaconda 4.0 distribution 1 . Section II provides the method proposed herein to map our problem (computation of the outputs for a neural layer with a given structure and parameters) into GPU. Our choice was to process batches of N input samples in order to avoid map- reduce operation after kernel computations. Such map-reduce operations (synaptic combiners) are now included in the kernels running as parallel threads in the GPU. Section III investigates how to choose the parameters of both GPU and neural structure in order to maximize efficiency. It is shown that a basic neural layer (BNL) can be designed such that speedups (in comparison to CPU) and GPU efficiency are maximized. Consequently, if the desired neural network layer is larger than the optimized BNL it can be split in a convenient number of such optimized units in order to get efficient processing. It is shown that given a GPU unit, there is an optimal set of parameters (the size M of the BPL, where M 2 corresponds to the number of threads running the basic kernel, as well as the block size parameter B) to get the best performance and speedup. Section IV investigates the influence of n (number of inputs or feature vector size) showing that in order to achieve a good efficiency, a large value of n , e.g. n>100 is desirable for GPU implementations. This situation corresponds well to the situation of image processing, as it often appears in deep learning neural structures. Section V investigates other possibilities to get efficient implementation, using Intel MKL libraries available in recent NUMPY distributions. The paper is closed with the concluding remarks section. II. MAPPING NEURAL LAYERS INTO THE GPU – A PROGRAMMER MODEL A. Neural network layers and parameters Figure 1 represents both the neural model and its mapping into GPU. The CPU mapping is simpler since it computes the output of each neuron (corresponding to one thread on GPU) 1 https://www.continuum.io/blog/developer-blog/anaconda-4- release 978-1-5386-2059-5/17/$31.00