Comparing the Performance of Centralized and Distributed Coordination on Systems with Improved Combining Switches NYU Computer Science Technical Report TR2003-849 Eric Freudenthal and Allan Gottlieb Department of Computer Science Courant Institute of Mathematical Sciences New York University {freudenthal, gottlieb}@nyu.edu Abstract— Memory system congestion due to serializa- tion of hot spot accesses can adversely affect the perfor- mance of interprocess coordination algorithms. Hardware and software techniques have been proposed to reduce this congestion and thereby provide superior system per- formance. The combining networks of Gottlieb et al. auto- matically parallelize concurrent hot spot memory accesses, improving the performance of algorithms that poll a small number of shared variables. The widely used “MCS” distributed-spin algorithms take a software approach: they reduce hot spot congestion by polling only variables stored locally. Our investigations detected performance problems in existing designs for combining networks and we propose mechanisms that alleviate them. Simulation studies de- scribed herein indicate that a centralized readers writers algorithms executed on the improved combining networks have performance at least competitive to the MCS algo- rithms. I. I NTRODUCTION It is well known that the scalability of inter- process coordination can limit the performance of shared-memory computers. Since the latency re- quired for coordination algorithms such as barri- ers or readers-writers increases with the available parallelism, their impact is especially important for large-scale systems. A common software technique used to minimize this effect is distributed local- spinning in which processors repeatedly access vari- ables stored locally (in so-called NUMA systems, the shared memory is physically distributed among the processors). An less common technique is to utilize special purpose coordination hardware such as the barrier network of [2], the CM5 Control Network [3], or the “combining network” of [1] and have the processors reference centralized memory. The idea behind combining is that when references to the same memory location meet at a network switch, they are combined into one reference that proceeds to memory. When the response to the combined messages reaches the switch, data held in the “wait buffer” is used to generate the needed second re- sponse. The early work at NYU on combining networks showed their great advantage for certain classes of memory traffic, especially those with a significant portion of hot-spot accesses (a disproportionately large percentage of the references to one or a few locations). It is perhaps surprising that this work did not simulate the traffic generated when all the processors engage in busy-wait polling, i.e., 100% hot-spot accesses (but see the comments on [7] in section III). When completing studies begun a number of years ago of what we expected to be very fast centralized algorithms for barriers and readers-writers, we were particularly surprised to find that the combining network performed poorly in this situation. While it did not exhibit the dis-