Performance and Timing Measurements in a Multi-core Beowulf Cluster Compute-Node Damian Valles, David H. Williams, and Patricia A. Nava Electrical & Computer Engineering, The University of Texas at El Paso, El Paso, Texas, United States {dvalles, williams, pnava}@ece.utep.edu 1 AbstractThis paper studies the timing and performance of a single compute-node in small Beowulf cluster. The processing architecture in each compute-node consists of two Intel Quad- core Xeon processors for a total eight cores. The purpose of this work was to investigate the limi- tations and bottlenecks that occur during compute intensive tasks. Timing and performance of two cores in a local processor were compared to that of all eight cores in both processors. High Per- formance Linpack (HPL) was utilized to measure the timing and performance at different problem, block and grid sizes. The results showed that performance was limited for small block sizes but increased substantially as the block size increased to the width of the FSB and MCH data bus size. As the block size increases further, performance decreases because it take more time to process each block decreasing core utilization. Keywords: cluster, tuning, quad-core, bench- mark 1. Introduction Beowulf clusters have become an economical approach to high computational needs for engineering and scientific applications. When introducing a Multi-core Architecture (MCA) environment to a Beowulf cluster, its effects can change the system from a homogeneous 1 This material is based upon work supported by the National Science Foundation under Grant No. CNS-0709438. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Network Switches were provided by Cisco Systems, Inc. to a heterogenous computing environment. A heterogeneous MCA can be defined as cores being of different sizes, different frequency performance, and the complexity of the individual architecture design of each core. Consequently, when introducing MCA to a High-Performance Computing (HPC) cluster several factors can potentially affect the internal scalability of multiple cores, as well as the environment of the cluster as it changes from a homogeneous to a heterogeneous environment. These include the system compiler, memory, I/O, frontside bus (FSB), chip set, etc [1]. Therefore, it is essential to have a good understanding of the system architecture in order to maximize the performance when distributing tasks. Virgo 2 is a Beowulf cluster in which each compute-node contains dual Quad-core CPUs for a total of eight cores per node. Both processors communicate with each other and to the rest of the system through their respective FSB to the Memory Controller Hub (MCH). The focus of this work is to analyze the architecture in this environment in order to find potential performance bottleneck that might be present during execution of jobs within the cores in each compute-node. Benchmarking is necessary to measure the timing and performance of the system in order to understand the architecture. For our work, High Performance Linpack (HPL) 2.0 was utilized in order to benchmark the performance of a single compute-node and to spawn a variable number of processes to each core. For the subprogram support of HPL, the Automatically