Comparison of MPI Benchmark Programs on an SGI Altix ccNUMA Shared Memory Machine Nor Asilah Wati Abdul Hamid, Paul Coddington and Francis Vaughan School of Computer Science, University of Adelaide Adelaide, SA 5005, Australia {asilah,paulc,francis}@cs.adelaide.edu.au Abstract The results produced by five different MPI bench- mark programs on an SGI Altix 3700 are analyzed and compared. There are significant differences in the results for some MPI operations. We investigate the reasons for these discrepancies, which are due to differences in the measurement techniques, implementation details and default configurations of the different benchmarks. The variation in results on the Altix are generally much greater than on a distributed memory machine, due pri- marily to the ccNUMA architecture and the importance of cache effects, as well as some implementation details of the SGI MPI libraries. 1. Introduction Several benchmark programs have been developed to measure the performance of MPI on parallel computers. The programs were primarily designed for, and have mostly been used on, distributed memory machines. However it is interesting to measure MPI performance on shared memory machines such as the SGI Altix, which has become a popular system for high-perform- ance computing. The hierarchical non-uniform memory architecture (NUMA) that is typical of large shared memory machines means that analysis of the perform- ance of shared memory machines is likely to be more complex than distributed memory machines, which are typically clusters with a fairly uniform communications architecture. In some recent work, we measured the MPI per- formance of the SGI Altix 3700 using MPIBench [6], a recently-developed MPI benchmark program. In order to check the results, and in particular to investigate anoma- lies in some results, we did similar measurements using other MPI benchmarks, which are reported here. We found that for some MPI routines, the results for differ- ent MPI benchmark programs were significantly differ- ent, and showed a greater variation than might have been expected, given that our experience with similar com- parisons on distributed memory machines showed much smaller variation in the results from different benchmark programs. In this paper we compare the results of differ- ent MPI benchmarks on the SGI Altix, and investigate the reasons for these variations, which are due to differ- ences in the measurement techniques, implementation details and default configurations of the different benchmarks. Many of these differences concern cache effects and communication patterns, which are much more important on shared memory machines with a cache-coherent NUMA architecture. However some of the effects appear to be artifacts of the implementation details of the SGI MPI libraries. 2. MPI Benchmark Software There are several different MPI benchmark programs that are in common use. They typically measure the av- erage times to complete a selection of MPI routines for different data sizes on a specified number of CPU using the following basic approach: loop over different MPI routines loop over different message sizes get start time loop over number of repetitions if this is a collective communication routine, do a barrier synchronization call the MPI routine end loop over repetitions get finish time average time = (finish time - start time) / number of repetitions end loop over message sizes end loop over MPI routines Most benchmarks use the standard MPI timer MPI_Wtime, and get accurate results by making lots of repetitions of the measurements in order to compute the average value. Most benchmarks have a fixed number of message sizes (at least by default), but some also provide