Evolutionary Training of Deep Neural Networks on Heterogeneous Computing Environments Subodh Kalia ∗ Department of EECS Syracuse, New York, US skalia@syr.edu Chilukuri K. Mohan Department of EECS Syracuse, New York, US mohan@syr.edu Ramakrishna Nemani BAER Institute Mountain View, California, US nemani@baeri.org ABSTRACT Deep neural networks are typically trained using gradient-based optimizers such as error backpropagation. This study proposes a framework based on Evolutionary Algorithms (EAs) to train deep neural networks without gradients. The network parameters, which may vary up to millions, are considered optimization variables. We demonstrate the training of an encoder-decoder segmentation net- work (U-Net) and Long Short-Term Memory (LSTM) model using ( + )-ES, Genetic Algorithm, and Particle Swarm Optimization. The framework can train models with forward propagation on ma- chines with diferent hardware in a cluster computing environment. We compare prediction results from the two models trained us- ing our framework and backpropagation. We show that the neural networks can be trained in less time on CPUs as compared to the training on specialized compute-intensive GPUs. CCS CONCEPTS · Computing methodologies → Genetic algorithms. KEYWORDS Neural Networks, Evolutionary Algorithms, Heuristics, Paralleliza- tion ACM Reference Format: Subodh Kalia, Chilukuri K. Mohan, and Ramakrishna Nemani. 2022. Evolu- tionary Training of Deep Neural Networks on Heterogeneous Computing Environments. In Proceedings of Genetic and Evolutionary Computation Con- ference Companion (GECCO ’22 Companion). ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3520304.3533954 1 INTRODUCTION Deep neural networks have gained near-human performance credi- bility in various tasks, including speech recognition, computer vi- sion, natural language processing, image analysis, and forecasting [15]. Such deep networks contain multiple layers in their archi- tectures and are trained end-to-end using the error-based back- propagation method. Adam [20], Stochastic Gradient Descent [2], and Adagrad [13] are some popular gradient-based optimizers in- cluded in various machine learning frameworks such as TensorFlow, MXNet, and Cafe. Such optimizers minimize the loss function by ∗ Corresponding author ACM acknowledges that this contribution was authored or co-authored by an employee, contractor, or afliate of the United States government. As such, the United States government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for government purposes only. GECCO ’22 Companion, July 9ś13, 2022, Boston, MA, USA © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9268-6/22/07. . . $15.00 https://doi.org/10.1145/3520304.3533954 computing the gradient of the loss function for network parameters. Machine learning models are trained on either single or multiple GPUs since gradient-based training on CPU is slow [17]. State-of- the-art deep neural network models can take days to train on the latest GPUs and may not be trained using traditional CPUs within a reasonable time frame. Further, in the case of a machine with multiple GPUs, it is preferred to have cards from the same vendor (either Nvidia or AMD) since using diferent cards on the same machine can cause performance issues. Hence, selecting the appro- priate hardware to train a model is a challenge, and the training time can vary from several days to weeks based on the selected hardware. Several nature-inspired algorithms such as Genetic Algorithm (GA), Evolutionary Strategies (ES), and Tabu Search have been used to evolve network architectures, and their hyper-parameters [4, 16]. Evolutionary Algorithms (EAs) are also used in a hybrid approach to train, evolve, and optimize network hyper-parameters [12, 18]. Direct methods to train neural networks using evolutionary algo- rithms require a higher population size which increases the training time for deep neural networks [5]. We deviate from the standard gradient-based approach and pro- pose a framework for distributed EAs to train deep neural networks of arbitrary sizes. We exploit the fact that the forward propagation (gradient-free) is much faster than the combined forward and back- ward propagation (standard gradient-based approach). To eliminate hardware-related limitations, we introduce a server-client based TCP/IP socket communication setup to perform the forward propa- gation task on multiple machines. We use the term ‘heterogeneous’ for a computing environment consisting of machines with diferent hardware running diferent operating systems. Since TCP/IP proto- col facilitates cross-platform communications, we use this protocol to send and receive data to machines connected by the high-speed intranet network inside the Pleiades supercomputer [8]. This paper is organized into the following sections. Section 2 describes our EA framework and the underlying evolutionary algo- rithms. Section 3 is a use case where we train an image segmentation model (U-Net) without gradients using our framework and compare the performance with the backpropagation method. Section 4 is a second use case where we train a Long Short-Term Memory (LSTM) model on a time series dataset using our framework. 2 EA FRAMEWORK A schematic for the working of the framework is shown in Fig- ure 1. The framework starts with a user-provided deep learning 2318