An Experimental Study on How to Build Efficient Multi-Core Clusters for High Performance Computing Luiz Carlos Pinto, Luiz H. B. Tomazella, M. A. R. Dantas Distributed Systems Research Laboratory ( LaPeSD ) Department of Informatics and Statistics ( INE ) Federal University of Santa Catarina ( UFSC ) { luigi, tomazella, mario }@inf.ufsc.br Abstract Multi-core technology produces a new scenario for communicating processes in an MPI cluster environment and consequently the involved trade-offs need to be uncovered. This motivation guided our research and lead to a new approach for setting up more efficient clusters built with commodities. Thus, alternatively to the utilization of non-commodity interconnects such as Myrinet and Infiniband, we present a proposal based on leaving cores idle relatively to application processing in order to build economically more accessible clusters of commodities with higher performance. Execution of fine-grained IS algorithm from NAS Parallel Benchmark revealed a speedup of up to 25%. Interestingly, a cluster organized according to the proposed setup was able to outperform a single multi-core SMP host in which all processes communicate inside the host. Therefore, empirical results indicate that our proposal has been successful for medium and fine-grained algorithms. 1. Introduction Scientific applications used to be executed especially on expensive and proprietary massively parallel processing (MPP) machines. As processing power and communication speed are increasingly becoming off-the-shelf products, building clusters of commodities [27] has been taking a large piece on high performance computing (HPC) world [5]. Not long ago, identical single-processor computing nodes used to be aggregated in order to form a cluster, also known as NoW (Network of Workstations). Such parallel architecture design demands a distributed memory programming interface like MPI [1] for inter- process communication. As each computing node has its own memory subsystem and path to the interconnect fabric, it roughly means that each MPI process execution is independent of the others. Nowadays, multi-processor (SMP) and now multi- core (CMP) technologies are increasingly finding their way into cluster computing. Inevitably, clusters built using SMP and also CMP-SMP nodes will become more and more common. Lacking of a widely accepted term for CMP-SMP cluster design, both architectures will be referenced as CLUMPS, a usual term for defining a cluster of SMP nodes. Traditional MPI programs follow the SPMD (Single Program, Multiple Data) parallel programming model which was designed basically for cluster architectures built using nodes with a single processing unit, that is to say single-core nodes. For example, in a modern cluster design built with multi-core multi-processor nodes, the access to interconnect fabric is shared by locally executing processes. Either main memory accessing or CMP-SMP’s usually deeper cache memory hierarchy might also slow down inter-process communication, since bus and memory subsystem of each node is shared. Thus, in such a modern cluster, moving data around between communicating cores is a function of not only their physical distance (inside a processor socket, inside a node or inter-node) but also of shared memory and network bandwidth limitations. Our motivation concerns the importance of realizing from the point of view of an architectural designer that modern multi-core cluster designs create a different scenario for predicting performance. This urging need to understand the trade-offs between these architectural cluster designs guided our research and finally lead to a new approach for setting up more efficient clusters of commodities. Thus, an alternative approach to the utilization of non-commodity interconnects, such as Myrinet and Infiniband, is proposed in order to build economically more accessible clusters of commodities with higher performance. 2008 11th IEEE International Conference on Computational Science and Engineering 978-0-7695-3193-9/08 $25.00 © 2008 IEEE DOI 10.1109/CSE.2008.63 33 2008 11th IEEE International Conference on Computational Science and Engineering 978-0-7695-3193-9/08 $25.00 © 2008 IEEE DOI 10.1109/CSE.2008.63 33 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.