High-Throughput Coherence Controllers Ashwini K. Nanda , Anthony-Trung Nguyen , Maged M. Michael , and Douglas J. Joseph IBM Research University of Illinois, Urbana-Champaign Thomas J. Watson Research Center Department of Computer Science Yorktown Heights, NY 10598 Urbana, IL 68101 ashwini,michael,djoseph @watson.ibm.com anguyen@cs.uiuc.edu Abstract Recent research shows that the occupancy of the co- herence controllers is a major performance bottleneck for distributed cache coherent shared memory multiprocessors. In this paper we study three approaches to alleviating this problem in hardwired coherence controllers, namely, multi- ple protocol engines, pipelined protocol engines, and split request-response streams. Split request-response streams is an innovative contri- bution of this paper. The performance of pipelining in the context of coherence controllers has not been presented in the literature. Multiple protocol engines has not been stud- ied in the context of hardwired controllers except for a study of ours and only to a limited extent. Using both commercial and scientific benchmarks on de- tailed simulation models, we present experimental results that show that each mechanism is highly effective at reduc- ing controller occupancy by as much as 66% and improving execution time by as much as 51%, for applications with high communication bandwidth requirement. A combina- tion of mechanisms further reduces controller occupancy and execution time by as much as 78% and 61%, respec- tively. Our results show that applying any of the parallel mech- anisms in the coherence controllers allows integrating four times as many processors per coherence controller, thus re- ducing system cost, while maintaining or even exceeding the performance of systems with larger number of coher- ence controllers. 1 Introduction Previous research has shown that scalable shared- memory performance can be achieved on directory- based cache-coherent multiprocessors such as the Stanford This work was performed at the IBM Thomas J. Watson Research Center as part of the HighT project. FLASH [3] and the MIT Alewife [1] machines. A key com- ponent of this type of machines is the coherence controller on each node that provides cache coherent access to mem- ory that is distributed among the nodes of the multiproces- sor. Recent research results [4, 11] show that the occupancy of the coherence controller (CC) can be the performance bottleneck for applications with high communication re- quirements. Motivated by these results, we study three approaches to alleviating this problem: multiple protocol engines (PEs); split request-response streams; and pipelined protocol en- gines. A coherence controller with multiple PEs can han- dle multiple coherence transactions in parallel. Each PE can have two protocol handling units, one for handling the request stream and the other for handling the response stream, in parallel. Finally, the protocol handling unit can be pipelined to allow overlapping queue arbitrations, direc- tory accesses, protocol processing, and state maintenance. The increased parallelism and overlapping of coherence op- erations offered by these approaches serves to decrease the overall occupancy of coherence controllers. Multiple PEs were employed in the Sun S3.mp [15] ar- chitecture. The coherence controller dedicates one proto- col engine for handling transactions to local addresses and one for remote addresses. The architects of the Sequent STiNG [8] system also considered a similar approach as one of the ways to reduce controller occupancy. However, the impact of this approach on the performance of these sys- tems was not studied. Michael et al [11, 10] evaluate systems with one local and one remote PEs in the context of comparing the per- formance of hardwired and programmable coherence con- trollers. Their results show significant improvements in the performance of such systems over systems with single protocol-engine controllers. They also find load imbalance of coherence traffic at the protocol engines indicating the potential for more improvements if this issue is resolved.