A Variable Group Block Distribution Strategy for Dense Factorizations on Networks of Heterogeneous Computers Alexey Lastovetsky, Ravi Reddy Department of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland {alexey.lastovetsky, manumachu.reddy}@ucd.ie Abstract. In this paper, we present a static data distribution strategy called Vari- able Group Block distribution to optimize the execution of factorization of a dense matrix on a network of heterogeneous computers. The distribution is based on a functional performance model of computers, which tries to capture differ- ent aspects of heterogeneity of the computers including the (multi-level) memory structure and paging effects. 1 Introduction The paper presents a static data distribution strategy called Variable Group Block dis- tribution to optimize the execution of factorization of a dense matrix on a network of heterogeneous computers. The Variable Group Block distribution strategy is a modifi- cation of Group Block distribution strategy, which was proposed in [1] for 1D parallel Cholesky factorization, developed into a more general 2D distribution strategy in [2] and applied to 1D LU factorization in [3], [4]. The Group Block distribution strategy is based on the performance model, which represents the speed of each processor by a constant positive number and computations are distributed amongst the processors such that their volume is proportional to this speed of the processor. However the single number model is efficient only if the relative speeds of the processors involved in the execution of the application are a constant function of the size of the problem and can be approximated by a single number. This is true mainly for homogeneous distributed memory systems where: – The processors have almost the same size at each level of their memory hierarchies, and – Each computational task assigned to a processor fits in its main memory. But the model becomes inefficient in the following cases: – The processors have significantly different memory structure with different sizes of memory at each level of memory hierarchy. Therefore, beginning from some problem size, the same task will still fit into the main memory of some processors and stop fitting into the main memory of others, causing the paging and visible degradation of the speed of these processors. This means that their relative speed will start significantly changing in favor of non-paging processors as soon as the problem size exceeds the critical value.