Hyperfast Parallel–Beam Backprojection Marc Kachelrieß, Member, IEEE, Michael Knaup, Olivier Bockenbach Abstract— Tomographic image reconstruction, such as the reconstruction of CT projection values, of tomosynthesis data, PET or SPECT events, is computational very demanding. The most time–consuming step is the backprojection which is often limited by the memory bandwidth. Recently, a novel general purpose architecture optimized for distributed computing became available: the Cell Broadband Engine (CBE). Its eight synergistic processing elements (SPEs) currently allow for a theoretical performance of 192 GFlops (3 GHz, 8 units, 4 floats per vector, 2 instructions, multiply and add, per clock). To maximize image reconstruction speed we modified our parallel–beam backprojection algorithm, that is highly optimized for standard PCs, and optimized the code for the Cell processor. Data mining techniques and double buffering of source data were extensively used to optimally utilize both the memory bandwidth and the available local store of each SPE. The pixel–driven backprojection code uses floating point arithmetic and either linear interpolation (LI) or nearest neighbor (NN) interpolation between neighboring detector channels. Performance was measured using simulated data with 512 parallel beam projections per half rotation and 1024 detector elements. The data were backprojected into an image of 512 by 512 pixels using our PC–based approach and the new Cell–based algorithm. Both the PC and the CBE were clocked at 3 GHz. Images obtained were found to be identical with both ap- proaches. A throughput of 11 fps (LI) and 15 fps (NN) was measured on the PC whereas the CBE achieved 126 fps (LI) and 165 fps (NN). Thereby, the Cell greatly outperforms today’s top–notch backprojections based on graphical processing units (GPU). Using both CBEs of our dual Cell–based blade (Mercury Computer Systems) one can backproject 252 images per second with LI and and 330 images per second with NN. I. I NTRODUCTION C ELL processors are general purpose processors that com- bine a PowerPC element (PPE) with eight synergistic processor elements (SPE) [1], [2], [3]. The SPEs are the most interesting feature of the Cell processor, as they are the source of its processing power. A single chip contains eight SPEs, each with an synergistic processing unit (SPU), a memory flow controller (MFC), and 256 kB of SRAM that are used as local store (LS) memory. The LS runs in its own address space at the full 3.2 GHz clock frequency. An SPU uses 128 bit vector operations and can execute up to eight floating point instructions per clock cycle. For our particular focus on backprojecting floating point values (4 bytes each) the data vector consists of four floats. A fast (96 byte per clock) element interconnect bus (EIB) connects the Cell processor’s PPE with the SPEs (figure 1). Prof. Dr. Marc Kachelrieß: Institute of Medical Physics (IMP), Uni- versity of Erlangen–N¨ urnberg, Henkestraße 91, 91052 Erlangen. E–mail: marc.kachelriess@imp.uni-erlangen.de. Dr. Michael Knaup: Institute of Medical Physics (IMP), Univer- sity of Erlangen–N¨ urnberg, Henkestraße 91, 91052 Erlangen. E–mail: michael.knaup@imp.uni-erlangen.de. Olivier Bockenbach: Mercury Computer Systems, Lepsiusstr. 70, 12163 Berlin. E–mail:olivier@mc.com Fig. 1. Block diagram of the Cell with pictures of one CBE and of the Mercury dual Cell–based blade Up to two instructions per cycle can be issued by each SPU to its seven execution units, organized in two pipelines. To overcome memory latency (the “memory wall”) DMA data transfers from and to the SPU can be scheduled in parallel with core execution. The PPE can be understood as being the controller or manager that distributes small tasks to the eight SPEs, which are the workers. In our case, communication between the manager and the workers is realized via mailboxes and DMA transfers. The aim of this investigation is to implement a 2D parallel– beam backprojection algorithm for the Cell processor and to benchmark its performance against PC–based implemen- tations. II. METHOD We consider a backprojection of type f (x, y)= dϑ p ( ϑ, a(ϑ)x + b(ϑ)y + c(ϑ) ) with f being the image and p being the (convolved) rawdata and where a, b, and c are arbitrary functions of ϑ. For example a scanner with projection angle ϑ and ray distance ξ to the origin would have a = cos ϑ, b = sin ϑ and c =0 such that the ray parameterized by the pair (ϑ, ξ ) is the line x cos ϑ + y sin ϑ = ξ and the look–up would occur at p(ϑ, ξ ). A. Implementation The backprojection integral is usually realized in a dis- cretized version called pixel–driven backprojection. Our refer- ence code is shown in the listings. Listing 1 shows the nearest neighbor interpolation, listing 2 performs linear interpolation. Apart from this unoptimized reference code our highly optimized PC–based implementation (pure C++, coded in