The Performance of Gridding/Degridding on the Cell/B.E. A.L. Varbanescu A. van Amesfoort H.J. Sips T. Cornwell B. Elmegreen A. Mattingly G. van Diepen R. van Nieuwpoort Abstract In this report we present our experience with porting two very time consuming kernels from radioastronomy imaging, i.e. the gridding/degridding operations. Both these procedures are implemented using convolutional resampling, a bandwidth-limited application which is not trivial to parallelize on a hybrid memory architecture. We brieﬂy discuss the role of the application in the radioastronomy context, and we present the Cell/B.E. multi-core processor as an interesting target for this application. Further, we show how the original reference implementation (sequential, C++ code) was not “Cell-friendly”, which is also why our ﬁrst parallel version had poor results. Next, we show how the reference algorithm was tuned to become more suitable for Cell parallelization. We discuss how we have improved our parallelization strategy and we present our performance results with the new tuned algorithm running on the Cell/B.E. processor, showing a speed-up factor of over 20 when compared with the reference implementation. We conclude that a single Cell/B.E. can provide good results for a small-scale version of the radioastronomy imaging problem, and we sketch our future work directions towards using a parallel machine with multiple Cell/B.E. processors to address the real scale of the of the gridding/degridding applications. 1 Introduction Due to large radioastronomy projec ts like ASKAP or LOFAR, followed by even larger multi- national projects like SKA, radioastronomy has become one of the very active high performance computing ﬁelds. Signiﬁcantly increased data volumes, combined with streaming requirements and complex algorithms, represent interesting challenges for supercomputers and new opportunities for multi-core processors. In the work reported here, we focus on a speciﬁc radioastronomy imaging application and its implementation on the Cell/B.E., a heterogeneous multi-core processor jointly built by IBM, Sony and Toshiba. Despite the complex programming eﬀort, described in the following sections, the performance we have obtained (a speed-up factor of over 20 for one Cell/B.E. when compared with a commodity machine) is encouraging. Furthermore, the predicted application scalability towards larger problem sizes is an additional incentive for using the Cell/B.E. as the next generation HPC hardware for radioastronomy. The rest of this report is structured as follows. Section 2 brieﬂy discusses our target appli- cation and its role in the radioastronomy ﬁeld, while Section 3 presents the Cell/B.E. processor, emphasizing the features that make it both very performant and diﬃcult to program. Next, in Section 4 we discuss the reference application in detail, focusing on showing its main challenges for an eﬃcient Cell/B.E. parallelization. We present our ﬁrst parallelization attempt and its results on the Cell/B.E. Based on our empirical analysis, backed by the results of this ﬁrst parallel version, 1