Multi-core Aware Optimization for MPI Collectives Bibo Tu 1,2 , Ming Zou 1,2 , Jianfeng Zhan 1 , Xiaofang Zhao 1 and Jianping Fan 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190 2 Graduate University of Chinese Academy of Sciences, Beijing 100049 {tbb, zm, jfzhan}@ncic.ac.cn, {zhaoxf, fan}@ict.ac.cn Abstract—MPI collective operations on multi-core clusters should be multi-core aware. In this paper, collective algorithms with hierarchical virtual topology focus on the performance difference among different communication levels on multi-core clusters, simply for intra-node and inter-node communication; Furthermore, to select befitting segment sizes for intra-node collective communication can cater to cache hierarchy in multi- core processors. Based on existing collective algorithms in MPICH2, above two techniques construct portable optimization methodology over MPICH2 for collective operations on multi- core clusters. Conforming to above optimization methodology, multi-core aware broadcast algorithm has been implemented and evaluated as a case study. The results of performance evaluation show that the multi-core aware optimization methodology over MPICH2 is efficient. I. INTRODUCTION In the new Top500 supercomputer list published in November 2007, multi-core clusters have been the most popular platforms in parallel computing [1]. Intuitively, multi- core processors can speedup application performance by dividing the workload to different cores. However, applications on multi-core clusters have not gotten optimal performance. Initially, applications are likely to treat multi- core processors, also called Chip Multiprocessor (CMP), simply as conventional symmetric multiprocessors (SMPs). However, chip multiprocessors with shared cache offer unique capabilities that are fundamentally different from SMPs, only with shared main memory. Accordingly, applications should be multi-core aware. For many scientific applications, the communication cost dominates overall execution time, so communication middleware (e.g. MPI) should also be multi-core aware. MPI has emerged as one of the primary programming paradigms for writing efficient parallel applications. It provides for a plethora of communication primitives with operations geared towards point-to-point and collective communications. When compared to the point-to-point performance, collective operation performance is often overlooked. However, profiling study [2] showed that some applications spend more than 80% of a transfer time in collective operations. Thus, improving the performance of collective operations is the key to enabling very high parallel speed-ups for parallel applications. Significant research has been carried out in the past for improving collective communication. Some work on collective communication focused on developing optimized algorithms for particular architectures, such as hypercube, mesh, torus or fat tree, with an emphasis on minimizing link contention, node contention, or the distance between communicating nodes [3, 4]. Automatically tuned collective communication algorithms under different conditions (message size, number of processes) have been developed [5, 6]. Another work on collective communication focused on exploiting hierarchy in parallel computer networks to optimize collective operation performance. One is to optimize MPI collective communication for WAN distributed environments (e.g. MagPIe) [7, 8], the goal is to minimize communication over slow wide-area links at the expense of more communication over faster local-area connections. The other is to optimize MPI collective communication for LAN environments through modifying MPICH ADI layer, target for SMP clusters, e.g. MPI-StarT [9]. Utilizing shared memory for implementing collective communication has been a well studied problem in the past. Some propose using remote memory operations across the cluster and shared memory within the cluster to develop efficient collective operations for clusters of SMPs [10, 11]. With the help of operating system and network drivers, minimizing the cost of memory copy to improve intra-node communication has also been applied in intra-node communication on more-core clusters recently [12]. Some above MPI-level algorithms have successfully been implemented in MPICH2 [13]. However, the collective algorithms currently employed don’t perform optimally on the new multi-core clusters. The optimal implementation of a collective for a given system mainly depends on virtual topology (e.g. flat-tree, binary tree, binomial tree etc.) and message sizes. Memory hierarchy on multi-core clusters gets more complicated, so exploiting hierarchy in different levels of communication (typically for three communication levels: intra-CMP, inter- CMP and inter-node, simply for two communication levels: intra-node and inter-node) on multi-core clusters to optimize collective operation performance, i.e. topology-aware algorithms for collective operations, will be essential, which is inspired by MagPIe and MPI-StarT; Furthermore, to select befitting segment sizes for intra-node collective communication can cater to cache hierarchy in multi-core processors (e.g. private L1 cache and shared L2 cache in Intel multi-core platform). The former emphasizes on the performance difference (communication hierarchy) in different levels of communication, the latter tries to improve cache hit ratio in intra-node collective communication. They together construct portable optimization methodology over MPICH2 for collective operations on multi-core clusters. As a case study, multi-core aware broadcast algorithm has been implemented and evaluated, conforming to above optimization methodology. 978-1-4244-2640-9/08/$25.00 2008 IEEE Accepted as a poster presentation 2008 IEEE International Conference on Cluster Computing 322