Optimized GPU Implementation of Learning-Based Non-Rigid Multi-Modal Registration Zhe Fan a , Christoph Vetter b , Christoph Guetter b , Daphne Yu b , R¨ udiger Westermann c , Arie Kaufman a , Chenyang Xu b * a Computer Science Department, Stony Brook University, Stony Brook, NY 11794-4400, USA b Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540-6632, USA c Computer Science Department, Technische Universit¨at M¨ unchen, Garching 85748, Germany ABSTRACT Non-rigid multi-modal volume registration is computationally intensive due to its high-dimensional parameter space, where common CPU computation times are several minutes. Medical imaging applications using registra- tion, however, demand ever faster implementations for several purposes: matching the data acquisition speed, providing smooth user interaction and steering for quality control, and performing population registration in- volving multiple datasets. Current GPUs offer an opportunity to boost the registration speed through high computational power at low cost. In our previous work, we have presented a GPU implementation of a non-rigid multi-modal volume registration that was 6 - 8 times faster than a software implementation. In this paper, we extend this work by describing how new features of the DX10-compatible GPUs and additional optimization strategies can be employed to further improve the algorithm performance. We have compared our optimized version with the previous version on the same GPU, and have observed a speedup factor of 3.6. Compared with the software implementation, we achieve a speedup factor of up to 44. Keywords: Non-rigid registration, Multi-modal volumes, Learning, Mutual information, Kullback-Leibler di- vergence, GPU 1. INTRODUCTION Non-rigid multi-modal registration is becoming one of the fundamental tasks in medical imaging. Its goal is to align two 2D/3D medical images of different modalities (e.g., CT, PET, SPECT, MRI) taking into considera- tion the unknown non-rigid deformation over time. Because the non-rigid deformation has a high-dimensional parameter space, the registration is computationally intensive. Fortunately, registration algorithms usually ex- hibit pixel-level parallelism in most parts of their computations. Therefore, the problem is suited for hardware acceleration using the commodity graphics processing units (GPUs). Modern GPUs have been designed to be extremely fast at raster-based rendering, which is a data parallel computation. In recent years, driven by the graphics market, the performance of GPUs has been growing at a much faster pace than that of CPUs. The raw computational power of GPUs has surpassed that of CPUs by an order of magnitude, while the costs of GPUs have remained low. Moreover, high-level languages, such as GLSL, 1 Cg, 2 and HLSL, have been incorporated into graphics APIs, OpenGL and DirectX, for GPU programming. Propelled by these factors, general-purpose computation using GPUs (GPGPU) has become an active research topic. Many applications, such as physically-based simulation, data processing, volume rendering, and database operations, have been mapped to GPUs for acceleration. The interested readers are referred to a survey paper 3 and a website of GPGPU 4 for further information. The rapid development of GPGPU have recently motivated major GPU vendors, such as NVIDIA and AMD, to target the high performance computing (HPC) market. They are continuously improving their programming interfaces, increasing the video memory sizes, and adding *Further author information: (Send correspondence to Chenyang Xu) Zhe Fan: fzhe@cs.stonybrook.edu, Christoph Vetter: christoph.vetter@siemens.com, Christoph Guetter: christoph.guetter@siemens.com, Daphne Yu: daphne.yu@siemens.com, R¨ udiger Westermann: westermann@in.tum.de, Arie Kaufman: ari@cs.stonybrook.edu, Chenyang Xu: chenyang.xu@siemens.com