A Algorithm XXX: Semi-stencil RA ´ UL DE LA CRUZ, Barcelona Supercomputing Center MAURICIO ARAYA-POLO, Repsol USA Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, e.g. geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neigh- bor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time. Stencil computation performance is bounded by two main factors: the memory access pattern and the inefficient reuse of the accessed data. We propose a novel algorithm, named Semi-stencil, that tackles these two problems. The main idea behind this algorithm is to change the way in which the stencil computation progresses within the computational domain. Instead of accessing all required neighbors and adding all their contributions at once, the Semi-stencil algorithm divides the computation into several updates. Then, each update gathers half of the axis neighbors, partially computing at the same time the stencil in a set of closely located points. As the Semi-stencil progresses through the domain, the stencil computations are completed on precomputed points. This computation strategy improves memory access pattern and efficiently reuses the accessed data. Our initial target architecture was the Cell/B.E., where the Semi-stencil in a SPE was 44% faster than the naive stencil implementation. Since then, we have continued our research on emerging multi-core archi- tectures in order to assess and extend this work on homogeneous architectures. The experiments presented combine the Semi-stencil strategy with space and time-blocking algorithms used in hierarchical memory architectures. Two x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM POWER6 and IBM BG/P) platforms are used as testbeds, where the best improvements for a 25-point stencil range from 1.27 to 1.76× faster. The results show that this novel strategy is a feasible optimization method which may be integrated into auto-tuning frameworks. Also, since all current architectures are multi-core based, we have introduced a brief section where scalability results on IBM POWER7, Intel Xeon and MIC based systems are presented. In a nutshell, the algorithm scales as well as or better than other stencil techniques. For instance, the scal- ability of the Semi-stencil on MIC for a certain testcase reached 93.8× over 244 threads. Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems— Modeling techniques; F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems—Computations in finite fields; G.1.8 [Numerical Analisys]: Partial Differential Equations— Finite Difference Methods General Terms: Algorithms, Experimentation, Measurement, Performance Additional Key Words and Phrases: stencil computation, Semi-stencil, Blocking, Time-skewing, Cache- Oblivious, HPC, code optimization, numerical algorithms, performance model This work was supported by project TIN2007-60625 of Spanish Government’s Science and Innovation Min- istry. Authors’ address: Ra ´ ul de la Cruz, CASE Department, Barcelona Supercomputing Center, Barcelona, Spain; email: delacruz@bsc.es; Mauricio Araya-Polo, Repsol USA, The Woodlands, TX, USA; email: araya.mauricio@repsol.com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c YYYY ACM 0098-3500/YYYY/01-ARTA $15.00 DOI:http://dx.doi.org/10.1145/0000000.0000000 ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: January YYYY.