The Performance of Gridding/Degridding on the Cell/B.E. A.L. Varbanescu A. van Amesfoort H.J. Sips T. Cornwell B. Elmegreen A. Mattingly G. van Diepen R. van Nieuwpoort Abstract In this report we present our experience with porting two very time consuming kernels from radioastronomy imaging, i.e. the gridding/degridding operations. Both these procedures are implemented using convolutional resampling, a bandwidth-limited application which is not trivial to parallelize on a hybrid memory architecture. We briefly discuss the role of the application in the radioastronomy context, and we present the Cell/B.E. multi-core processor as an interesting target for this application. Further, we show how the original reference implementation (sequential, C++ code) was not “Cell-friendly”, which is also why our first parallel version had poor results. Next, we show how the reference algorithm was tuned to become more suitable for Cell parallelization. We discuss how we have improved our parallelization strategy and we present our performance results with the new tuned algorithm running on the Cell/B.E. processor, showing a speed-up factor of over 20 when compared with the reference implementation. We conclude that a single Cell/B.E. can provide good results for a small-scale version of the radioastronomy imaging problem, and we sketch our future work directions towards using a parallel machine with multiple Cell/B.E. processors to address the real scale of the of the gridding/degridding applications. 1 Introduction Due to large radioastronomy projec ts like ASKAP or LOFAR, followed by even larger multi- national projects like SKA, radioastronomy has become one of the very active high performance computing fields. Significantly increased data volumes, combined with streaming requirements and complex algorithms, represent interesting challenges for supercomputers and new opportunities for multi-core processors. In the work reported here, we focus on a specific radioastronomy imaging application and its implementation on the Cell/B.E., a heterogeneous multi-core processor jointly built by IBM, Sony and Toshiba. Despite the complex programming effort, described in the following sections, the performance we have obtained (a speed-up factor of over 20 for one Cell/B.E. when compared with a commodity machine) is encouraging. Furthermore, the predicted application scalability towards larger problem sizes is an additional incentive for using the Cell/B.E. as the next generation HPC hardware for radioastronomy. The rest of this report is structured as follows. Section 2 briefly discusses our target appli- cation and its role in the radioastronomy field, while Section 3 presents the Cell/B.E. processor, emphasizing the features that make it both very performant and difficult to program. Next, in Section 4 we discuss the reference application in detail, focusing on showing its main challenges for an efficient Cell/B.E. parallelization. We present our first parallelization attempt and its results on the Cell/B.E. Based on our empirical analysis, backed by the results of this first parallel version, 1