Designing Computational Clusters for Performance and Power Kirk W. Cameron, Rong Ge, Xizhou Feng Abstract Power consumption in computational clusters has reached critical levels. High-end cluster performance improves exponentially while the power consumed and heat dissipated increase operational costs and failure rates. Yet, the demand for more powerful machines continues to grow. In this chapter, we motivate the need to reconsider the traditional performance-at-any-cost cluster design approach. We propose designs where power and performance are considered critical constraints. We describe power-aware and low power techniques to reduce the power profiles of parallel applications and mitigate the impact on performance. Introduction High-end computing systems are a crucial source for scientific discovery and technological revolution. The unmatched level of computational capability provided by high-end computers enables scientists to solve challenging problems that are insolvable by traditional means and to make breakthroughs in a wide spectrum of fields such as nanoscience, fusion, climate modeling and astrophysics [40, 63]. The designed peak performance for high-end computing systems has increased rapidly in the last two decades. For example, the peak performance of the No.1 supercomputer in 1993 was below 100Gflops. This value increased 2800 times within 13 years to 280TFlops in 2006 [65]. Two facts primarily contribute to the increase in peak performance of high-end computers. The first is increasing microprocessor speed. The operating frequency of a microprocessor almost doubled every 2 years in the 90’s [10]. The second is the increasing size of high-end computers. The No.1 supercomputer in the 1990’s consists of about 1000 processors; today’s No.1 supercomputer, BlueGene /L, is about 130 times larger, consisting of 131,072 processors [1]. There is an increasing gap between achieved “sustained” performance and the designed peak performance. Empirical data indicates that the sustained performance achieved by average scientific applications is about 10-15% of the peak performance. Gordon Bell prize winning applications [2, 59, 61] sustain 35% to 65% of peak performance. Such performance requires the efforts of a team of experts working collaboratively for years. LINPACK [25], arguably the most scalable and optimized benchmark code suite, averages about 67% of the designed peak performance on TOP500 machines in the past decade [24]. 1