GPU Cluster for High Performance Computing Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-Stover {fzhe, qfeng, ari, suzi}@cs.sunysb.edu Center For Visual Computing and Department of Computer Science Stony Brook University Stony Brook, NY 11794-4400 ABSTRACT Inspired by the attractive Flops/dollar ratio and the incredi- ble growth in the speed of modern graphics processing units (GPUs), we propose to use a cluster of GPUs for high perfor- mance scientific computing. As an example application, we have developed a parallel flow simulation using the lattice Boltzmann model (LBM) on a GPU cluster and have sim- ulated the dispersion of airborne contaminants in the Times Square area of New York City. Using 30 GPU nodes, our simulation can compute a 480x400x80 LBM in 0.31 sec- ond/step, a speed which is 4.6 times faster than that of our CPU cluster implementation. Besides the LBM, we also dis- cuss other potential applications of the GPU cluster, such as cellular automata, PDE solvers, and FEM. Keywords: GPU cluster, data intensive computing, lattice Boltzmann model, urban airborne dispersion, computational fluid dynamics 1 I NTRODUCTION The GPU, which refers to the commodity off-the-shelf 3D graphics card, is specifically designed to be extremely fast at processing large graphics data sets (e.g., polygons and pix- els) for rendering tasks. Recently, the use of the GPU to accelerate non-graphics computation has drawn much atten- tion [6, 16, 3, 29, 10, 28]. This kind of research is propelled by two essential considerations: Price/Performance Ratio: The computational power of to- day’s commodity GPUs has exceeded that of PC-based CPUs. For example, the nVIDIA GeForce 6800 Ultra, recently released, has been observed to reach 40 GFlops in fragment processing [11]. In comparison, the theo- retical peak performance of the Intel 3GHz Pentium4 using SSE instructions is only 6 GFlops. This high GPU performance results from the following: (1) A current GPU has up to 16 pixel processors and 6 ver- tex processors that execute 4-dimensional vector float- SC’04, November 6-12, 2004, Pittsburgh PA, USA 0-7695-2153-3/04 $20.00 (c)2004 IEEE ing point instructions in parallel; (2) pipeline constraint is enforced to ensure that data elements stream through the processors without stalls [29]; and (3) unlike the CPU, which has long been recognized to have a mem- ory bottleneck for massive computation [2], the GPU uses fast on-board texture memory which has one or- der of magnitude higher bandwidth (e.g., 35.2GB/sec on the GeForce 6800 Ultra). At the same time, the booming market for computer games drives high vol- ume sales of graphics cards which keeps prices low compared to other specialty hardware. As a result, the GPU has become a commodity SIMD machine on the desktop that is ready to be exploited for computation exhibiting high compute parallelism and requiring high memory bandwidth. Evolution Speed: Driven by the game industry, GPU per- formance has approximately doubled every 6 months since the mid-1990s [15], which is much faster than the growth rate of CPU performance that doubles every 18 months on average (Moore’s law), and this trend is ex- pected to continue. This is made possible by the explicit parallelism exposed in the graphics hardware. As the semiconductor fabrication technology advances, GPUs can use additional transistors much more efficiently for computation than CPUs by increasing the number of pipelines. Recently, the development of GPUs has reached a new high-point with the addition of single-precision 32bit float- ing point capabilities and the high level language program- ming interface, called Cg [20]. The developments mentioned above have facilitated the abstraction of the modern GPU as a stream processor. Consequently, mapping scientific com- putation onto the GPU has turned from initially hardware hacking techniques to more of a high level designing task. Many kinds of computations can be accelerated on GPUs including sparse linear system solvers, physical simulation, linear algebra operations, partial difference equations, fast Fourier transform, level-set computation, computational ge- ometry problems, and also non-traditional graphics, such as volume rendering, ray-tracing, and flow visualization. (We