An Improved CUDA-Based Implementation of Differential Evolution on GPU A. K. Qin INRIA Grenoble Rhone-Alpes 655 avenue de l’Europe, Montbonnot 38334 Saint Ismier Cedex, France kai.qin@inria.fr Federico Raimondo INRIA Grenoble Rhone-Alpes 655 avenue de l’Europe, Montbonnot 8334 Saint Ismier Cedex, France federaimondo@gmail.com Yew Soon Ong School of Computer Engineering Nanyang Technological University Nanyang Avenue, 639798, Singapore asysong@ntu.edu.sg Florence Forbes INRIA Grenoble Rhone-Alpes 655 avenue de l’Europe, Montbonnot 8334 Saint Ismier Cedex, France florence.forbes@inria.fr ABSTRACT Modern GPUs enable widely affordable personal computers to carry out massively parallel computation tasks. NVIDIA’s CUDA technology provides a wieldy parallel computing platform. Many state-of-the-art algorithms arising from different fields have been redesigned based on CUDA to achieve computational speedup. Differential evolution (DE), as a very promising evolutionary algorithm, is highly suitable for parallelization owing to its data- parallel algorithmic structure. However, most existing CUDA- based DE implementations suffer from excessive low-throughput memory access and less efficient device utilization. This work presents an improved CUDA-based DE to optimize memory and device utilization: several logically-related kernels are combined into one composite kernel to reduce global memory access; kernel execution configuration parameters are automatically determined to maximize device occupancy; streams are employed to enable concurrent kernel execution to maximize device utilization. Experimental results on several numerical problems demonstrate superior computational time efficiency of the proposed method over two recent CUDA-based DE and the sequential DE across varying problem dimensions and algorithmic population sizes. Categories and Subject Descriptors I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search Keywords CUDA, Compute Unified Device Architecture, DE, Differential Evolution, GPU, Graphics Processing Unit, Massively Parallel Computing 1. INTRODUCTION In past decades, evolutionary algorithms (EAs) [1] have shown remarkable efficacy for solving diverse real-world optimization problems. However, they may expend considerable computation time when handling large-scale and complex tasks. Consequently, EAs are off-limits to a range of applications with demanding computational budgets. EAs consist of a population of candidate solutions that explore a given solution space using various nature-inspired operations, such as selection, reproduction and replacement, to gradually evolve the population in the quest for global optima. This type of algorithms is inherently parallelizable since population members are typically subjected to same operations. However, a majority of the existing EAs had been designed and implemented in the sequential way because hardware and software platforms that facilitate parallel computing were not widely available and affordable in the past. In recent years, the graphics processing unit (GPU) has emerged as a powerful computing device that can support general-purpose massively data-parallel computation by means of its hundreds of streaming processors (SPs). Nowadays, with affordable prices and wieldy parallel computing platforms, modern GPUs have empowered numerous personal computers (PCs) the capability of developing parallel applications. Among existing parallel computing platforms on GPU, NVIDIA’s compute unified device architecture (CUDA) [2, 3] provides an intuitive and scalable programming model based on an extended C programming language: CUDA-C. Developers can simply write a C-style routine to process one data element, which then gets automatically distributed across hundreds of SPs for thousands of threads to process different data elements. Due to little efforts for developers already familiar with the C language to grasp CUDA- C, many state-of-the-art algorithms from different scientific and engineering fields have been redesigned based on CUDA to speed up their computation. However, computational time efficiency of CUDA-C applications depends on comprehensive consideration of various technical properties of GPUs during development and implementation. Without delicate consideration, parallel programs written in CUDA-C might even run slower than their sequential counterparts. Differential evolution (DE) [4], as one of the most promising state-of-the-art EAs, has consistently demonstrated superiority for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO’12, July 7–11, 2012, Philadelphia, USA. Copyright 2012 ACM 978-1-4503-1177-9/12/07...$10.00.