c British Computer Society 2001 A Data-Parallel Formulation for Divide and Conquer Algorithms M. AMOR 1 , F. ARG ¨ UELLO 2 , J. L ´ OPEZ 3 , O. PLATA 3 AND E. L. ZAPATA 3 1 Department of Electronics and Systems, University of La Coru˜ na, E–15071 La Coru˜ na, Spain 2 Department of Electronics and Computation, University of Santiago de Compostela, E–15782 Santiago de Compostela, Spain 3 Department of Computer Architecture, University of M´ alaga, E–29071 M´ alaga, Spain Email: margaaml@udc.es This paper presents a general data-parallel formulation for a class of problems based on the divide and conquer strategy. A combination of three techniques—mapping vectors, index-digit permutations and space-filling curves—are used to reorganize the algorithmic dataflow, providing great flexibility to efficiently exploit data locality and to reduce and optimize communications. In addition, these techniques allow the easy translation of the reorganized dataflows into HPF (High Performance Fortran) constructs. Finally, experimental results on the Cray T3E validate our method. Received 23 December 1999; revised 5 April 2001 1. INTRODUCTION The design of compilers for parallel machines that generate (semi-)automatically parallel programs with acceptable performance is a research area of increasing interest. In order to facilitate the analysis done by the compiler, parallelization environments based on language extensions have been proposed. High Performance Fortran (HPF) [1, 2, 3] is one of the most significant cases, based on the data-parallel paradigm. HPF permits the programmer to specify data distributions, parallel sections, communi- cations/synchronization optimizations and so on. However, writing efficient HPF programs is not necessarily a trivial task. Indeed, the high-level nature of the language often leads to inefficient parallel code. Additionally, the suitability of HPF to obtain efficient parallel codes for a large variety of complex applications has not been sufficiently proven. The programmer needs a good understanding of both the application and the HPF execution model in order to map data efficiently on the processors’ memory and to organize communications. In order to help the programmer in this difficult task, we have developed a data-parallel framework that permits the description in a uniform, methodical and precise way, of an important class of complex problems, based on the divide and conquer (DC) method. Under this framework, the problem may be adapted to the data-parallel programming paradigm, in such a way that it can be easily translated into HPF. Our data-parallel formulation of a DC problem requires four phases. Linearization. The programming model implies the use of linear arrays to represent the data. As our aim is to deal with DC problems, typically the data is already structured either as linear arrays or as trees. In the second case, we may use space-filling curves [4], such as, for instance, Peano–Hilbert (PH) curves. Data distribution. After linearizing the problem, mapping vectors [5] are used to describe the mapping problem, that is, how data and computations are mapped on the processors of the parallel computer. In a data-parallel paradigm, only data distributions are considered, as computations are assigned to processors based on the owner-computes rule. Computation/communication organization. This is one of the key steps when obtaining an efficient parallel implementation of the DC problem. Index-digit oper- ators [6] are used to organize the problem dataflow, ex- pressing computations and communications as operator strings. This operational description of the problem may be used to rearrange its dataflow, with the aim of optimizing data locality (minimizing communications) and workload balancing. As data are already arranged as linear arrays (sequences), each data item may be identified by its position (or index) in such a sequence. Hence, index-digit operators are defined as permuta- tions of the digits in the numerical representation of the indices of the data sequence. Such permutations result in permutations of the data sequence itself. Workload balancing. Finally, in dynamic computations, a workload balancing scheme should be included in the parallel problem. Figure 1 shows schematically the two key components of the described parallelization framework. THE COMPUTER J OURNAL, Vol. 44, No. 4, 2001