Data-Parallel Algorithms for Large-Scale Real-Time Simulation of the Cellular Potts Model on Graphics Processing Units Jose Juan Tapia and Roshan D’Souza Department of Mechanical Engineering-Enginering Mechanics Michigan Technological Institute Houghton,MI,USA {jjtapiav,rmdsouza}@mtu.edu Abstract—In the following paper we present techniques for data-parallel execution of the Cellular Potts Model (CPM) on Graphics Processing Units (GPUs). We have developed data- structures and algorithms that are optimized to use available hardware resources on the GPU. To the best of our knowledge, this is the first attempt at using data-parallel techniques for simulating the CPM. We benchmarked this implementation against other parallel CPM implementations using traditional CPU clusters. Experimental results demonstrate that this im- plementation solves many of the drawbacks of traditional CPU clusters, and results in a performance gain of up to 30x, without sacrificing the integrity of the original model. Index Terms—Cellular Potts Model, GPGPU, Cellular Arrays and Automata, Biophysics I. I NTRODUCTION Computational Biology has emerged as an area that serves as an investigatory compass for biologists [1]. Although it will not replace in-vivo and in-vitro experimentation, it enables reduction of the search space through virtual experimentation using in-silico models [2]. Typical techniques for simulating biological models include analytical techniques (systems of differential equations) [3] and Monte-Carlo style techniques such as the Gillespie algorithm [4], agent-based modeling [5], and Cellular Potts Models [6]. The Monte-Carlo style techniques are inherently capable of capturing the heterogene- ity and stochasticity exhibited in many biological systems. However, they rely on multiple simulation runs to generate dense data-sets for statistical analysis. Moreover, simulating high-fidelity models using these techniques is often beyond the processing capabilities of a single Central Processing Unit (CPU). The obvious solution to scale beyond the capabilities of a single CPU is to divide computation among many CPUs using parallel computing techniques. However, this solution brings its own set of problems into the mix. Due to the difference in memory bandwiths between Random Access Memory (RAM) and inter-CPU communications, it is often the case that for problems that are not embarrassingly parallel, scaling is more- often than not below par. In fact, in certain situations, adding additional CPUs actually reduces performance due to com- munication overheads. The second is the cost associated with acquiring and maintaining a cluster. These include costs associ- ated with assembly, installation, and powering of the indiviudal processing nodes as well as the communication infrastructure consisting of high-speed communication networks and routers. In addition, CPUs are optimized for von-Neuman style computation and have much real-estate on the integrated chip devoted to control. On the other hand, data-parallel architec- tures such as Graphics Processing Units (GPUs) are optimized for high through-put with a simplified memory architecture and most resources devoted to computing. They are increasingly becoming a powerful and economic alternative to multi-CPU parallel computing systems, particularly for scientific com- puting. GPUs initially had fixed functionality. However, the demand for customizable computer graphics routines led GPU vendors to introduce programmability. Computational scien- tists have used this programmability to develop fast algorithms for scientific computations. This technique is generally known as General Purpose Graphics Processing Unit (GPGPU) [7]. While GPUs have much higher through-put, this perfor- mance advantage is gained through some restrictions on the types of computations that can be performed on individual computing cores of a GPU. While the cost of launching a single execution thread is fairly small, the amount of memory resources associated with each thread are limited. Therefore GPU threads work most efficiently if the code executed is non- blocking with minimum branching. Consequently, algorithms and code developed for CPU execution cannot be directly ported to a GPU. GPU execution requires entirely new sets of algorithms that are optimized for the architecture. In this paper we describe algorithms for executing the Cellular Potts Model on data-parallel architectures such as the GPU. Data structures have been developed to efficiently handle computation of the local and non-local effective energy terms. We have optimized memory bandwith with proper uses of different memory types such as texture, global, and shared memory. Benchmarks show a substantial performance gain when compared against results obtained from parallelization of the CPM on traditional CPU clusters. This implementation uses Compute Unified Device Architecture (CUDA), an API developed by NVIDIA specifically for non-graphics applica- 978-1-4244-2794-9/09/$25.00 c 2009 IEEE SMC 2009