The Impact of Noise on the Scaling of Collectives: A Theoretical Approach [Extended Abstract] Saurabh Agarwal 1 , Rahul Garg 1 , and Nisheeth K.Vishnoi 1 IBM India Research Lab, Block-1, IIT Delhi, Hauz Khas, New Delhi, India 110016. {saurabh.agarwal,grahul,nvishnoi}@in.ibm.com Abstract. The performance of parallel applications running on large clusters is known to degrade due to the interference of kernel and dae- mon activities on individual nodes, often referred to as noise. In this paper, we focus on an important class of parallel applications, which re- peatedly perform computation, followed by a collective operation such as a barrier. We model this theoretically and demonstrate, in a rigor- ous way, the effect of noise on the scalability of such applications. We study three natural and important classes of noise distributions: The ex- ponential distribution, the heavy-tailed distribution, and the Bernoulli distribution. We show that the systems scale well in the presence of expo- nential noise, but the performance goes down drastically in the presence of heavy-tailed of Bernoulli noise. 1 Introduction Motivation. It is well known that many parallel applications do not scale well on large high-performance computing systems [1–3]. The per-node performance degradation is more pronounced in systems with more than 1K nodes, running a multi-tasking operating system such as Unix. In order to build high-performance computing systems that are capable of very high and sustained performance, it is important to understand the reasons for such performance degradation. It is increasingly becoming evident that one of the main causes of performance degradation is the noise in the system; in the form of daemons and interrupts, see [1, 2]. A detailed study of the noise and its impact on performance was done by Petrini et al. [3] on the 8192 processor ASCI Q machine. It was observed that the overheads due to noise were mostly in the range 0.5% to 2.5% (see Figure 9 in [3]). However, this noise had a large impact on the system performance. By reducing the intensity of noise in the system it was possible to get a factor of 13 improvement in the performance of a micro-benchmark that repeatedly calls barrier with no intervening computation. Similarly, Kramer and Ryan [4] concluded that the performance variability in EP (of NAS parallel benchmarks) was due to the noise in systems.