Eurographics Symposium on Parallel Graphics and Visualization (2009) J. Comba, K. Debattista, and D. Weiskopf (Editors) Fast Parallel Unbiased Diffeomorphic Atlas Construction on Multi-Graphics Processing Units Linh K. Ha, Jens Krüger, P. Thomas Fletcher, Sarang Joshi and Cláudio T. Silva, Member, IEEE 1 1 Scientific Computing and Imaging Institute at the University of Utah Abstract Unbiased diffeomorphic atlas construction has proven to be a powerful technique for medical image analysis, particularly in brain imaging. The method operates on a large set of images, mapping them all into a common coordinate system, and creating an unbiased common template for studying intra-population variability and inter- population differences. The technique has also proven effective in tissue and object segmentation via registration of anatomical labels. However, a major barrier to the use of this approach is its high computational cost. Es- pecially with the increasing number of inputs and data size, it becomes impractical even with a fully optimized implementation on CPUs. Fortunately, the highly element-wise independence of the problem makes it well suited for parallel processing. This paper presents an efficient implementation of unbiased diffeomorphic atlas construc- tion on the new parallel processing architecture based on Multi-Graphics Processing Units (Multi-GPUs). Our results show that the GPU implementation gives a substantial performance gain on the order of twenty to sixty times faster than a single CPU and provides an inexpensive alternative to large distributed-memory CPU clusters. Categories and Subject Descriptors (according to ACM CCS): GPGPU applications, Parallel programming 1. Introduction Construction of atlases is a key procedure in population- based medical image analysis. In the paradigm of compu- tational anatomy, the atlas serves as a deformable template [Gre94], which is mapped to each individual anatomy. The deformable template provides a common coordinate sys- tem for individual or group analysis of detailed imaging data, including structural, biochemical, functional, or vascu- lar information. The transformations mapping each individ- ual anatomy to the atlas encode the anatomical variability of the population under study. Recently, this concept has also been extended to study anatomical change as a function of age in a population by generalizing non-parametric regres- sion [DFBJ07]. A major barrier to the use of such methods is the high cost associated with the atlas construction. Efficient and scalable solutions for the atlas construction are becoming critical to the analysis of large brain imag- ing studies due to the ever expanding size of the input data. Advances in magnetic resonance imaging (MRI) are resulting in increasingly higher resolution images. Further- more, the trend in neuroimaging studies is towards multi- site collection of large numbers of images, including lon- gitudinal data. For instance, the Alzheimer’s Disease Neu- roimaging Initiative currently includes over 900 subjects, most imaged at multiple time-points. Consequently, fast de- formable atlas construction has become a subject of consid- erable interests [CMVG96, BNG96]. However, current CPU- based solutions depend on expensive parallel systems, either shared memory symmetric multiprocessor machines or dis- tributed memory clusters. Furthermore, there is only a mod- est amount of parallelism within a single processing core and communication between processing units is expensive. Thus, parallel CPU implementations are still time consum- ing and do not exhibit well-behaved scalability. In this paper we present a multiple GPU atlas construction framework based on the unbiased diffeomorphic atlas for- mulation [JDJG04]. Our method achieves both high qual- ity and extremely fast processing time by exploiting the parallel hardware architecture of multi-GPUs. Our frame- work includes an optimized 3D image processing library, a hardware supported nonlinear ordinary differential equa- tion (ODE) integration, and a multiscale successive over- relaxation (SOR) solver for Helmholtz-like partial differ- ential equations (PDEs). Our system also exploits the co- herency of the vector fields, the massive parallelization of GPU hardware, the scalability of multi-GPU architectures, and the efficiency and robustness of multiscale techniques. The system builds atlases with comparable quality to those constructed by CPU algorithms [JDJG04, LDJ05]. Our sys- tem is 20-60 times faster than a well-optimized single core CPU algorithm, and still an order of magnitude faster than optimized multi-core CPU algorithms, while demonstrating a linear scalability curve. In designing an efficient parallel atlas construction, we overcame three challenging issues: c The Eurographics Association 2009.