A Study of Data Sets and Affinity in the Perfect Club  Eduard Ayguadé, Jesús Labarta, Jordi Garcia, Mercè Girones and Mateo Valero Computer Architecture Department, Polytechnic University of Catalunya Gran Capita s/n, Modul D6, 08071 - Barcelona (Spain) Abstract Most of the approaches for automatic data partitioning are based on the Component Affinity Graph (CAG). The CAG stores the relevant information about preferences and conflicts in the alignment step of the data partitioning process. We have done some measurements to estimate the computational power required to process the CAG and the chances for finding regular data distributions among processors. Three optimizations have been performed to increase the amount of affinity relations obtained by current approaches: expression substitution, subscript substitution, and induction variable detection. A significant amount of new affinity relations are obtained when these optimizations are applied. 1 Introduction Data distribution is the key point in a restructuring environment for Massive Parallel Machines, where each processor is assumed to have direct access to a local (or close) memory and indirect access to the remote memories of other processors. The latter can be implemented by message passing and therefore be more time consuming. Any processor can send and receive data from other processors in the system through a set of communication primitives. Several approaches have been presented in the literature to compile sequential programs in order to exploit parallelism when executed in distributed-memory multiprocessors. Proposed methods can be classified into two groups: • methods based on a user-provided specification of the data decomposition: the compiler takes care of the problem of generating each node program including all data movement primitives required to access to non-local data according to the data decomposition specified by the user. • methods based on the automatic generation of data decompositions based on information provided by dependence and data access analysis from the source sequential program. We are interested in the second problem and, in our data distribution tool, it is tackled as an automatic source-to-source restructuring process with a target sequential code that includes directives that specify how data is decomposed among the local memories of the node processors. Most of the approaches for automatic data partitioning [LiCh90, Gupt92, KCKC93, ...] are based on the Component Affinity Graph (CAG). The CAG stores the relevant information about preferences and conflicts in the alignment step of the data partitioning process. The main purpose of our study is to have information about CAG complexities and reference patterns that are