Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches Eric Freudenthal and Allan Gottlieb Department of Computer Science Courant Institute of Mathematical Sciences New York University {freudenthal, gottlieb}@nyu.edu Abstract Memory system congestion due to serialization of hot spot accesses can adversely affect the performance of interprocess coordination algorithms. Hardware and software techniques have been proposed to reduce this congestion and thereby provide superior system performance. The combining networks of Gottlieb et al. automatically parallelize concurrent hot spot memory accesses, improving the performance of algorithms that poll a small number of shared variables. We begin by debunking one of the performance claims made for the NYU Ultracomputer. Specifically, a gap in its simula- tion coverage hid a design flaw in the combining switches that seriously impacts the performance of busy wait polling in centralized coordination algorithms. We then debug the system by correcting the design and closing the simulation gap, after which we are able to duplicate the original claims of excellent performance on busy wait polling. Specifically our simulations show that, with the revised design, the Ultracomputer readers- writers and barrier algorithms achieve performance compa- rable to the highly regarded MCS algorithms. 1. Introduction It is well known that the scalability of inter- process coordination can limit the performance of shared-memory computers. Since the latency re- quired for coordination algorithms such as barri- ers or readers-writers increases with the available parallelism, their impact is especially important for large-scale systems. A common software technique used to minimize this effect is distributed local- spinning in which processors repeatedly access vari- ables stored locally (in so-called NUMA systems, the shared memory is physically distributed among the processors). An less common technique is to utilize special purpose coordination hardware such as the barrier network of [1], the CM5 Control Network [2], or the NYU “combining network” [8] and have the processors reference centralized memory. The idea behind the combining network is that when refer- ences to the same memory location meet at a net- work switch, they are combined into one reference that proceeds to memory. When the response to the combined messages reaches the switch, data held in the “wait buffer” is used to generate the needed second response. Other approaches to combining have been pursued as well, see for example [23] and [13]. The early work at NYU on combining networks showed their great advantage for certain classes of memory traffic, especially those with a significant portion of hot-spot accesses (a disproportionately large percentage of the references to one or a few locations). It is perhaps surprising that this work did not simulate the traffic generated when all the processors engage in busy-wait polling, i.e., 100% hot-spot accesses (but see the comments on [14] in Section 3). When completing studies begun a number of years ago of what we expected to be very fast centralized algorithms for barriers and readers- writers, we were particularly surprised to find that the combining network performed poorly in this situation. While it did not exhibit the disastrous serialization characteristic of accesses to a single