Cluster Communication Protocols for Parallel-Programming Systems KEES VERSTOEP, RAOUL A. F. BHOEDJANG, TIM R ¨ UHL, HENRI E. BAL, and RUTGER F. H. HOFMAN Vrije Universiteit Clusters of workstations are a popular platform for high-performance computing. For many parallel applications, efficient use of a fast interconnection network is essential for good performance. Sev- eral modern System Area Networks include programmable network interfaces that can be tailored to perform protocol tasks that otherwise would need to be done by the host processors. Finding the right trade-off between protocol processing at the host and the network interface is difficult in general. In this work, we systematically evaluate the performance of different implementations of a single, user-level communication interface. The implementations make different architectural assumptions about the reliability of the network and the capabilities of the network interface. The implementations differ accordingly in their division of protocol tasks between host software, network-interface firmware, and network hardware. Also, we investigate the effects of alternative data-transfer methods and multicast implementations, and we evaluate the influence of packet size. Using microbenchmarks, parallel-programming systems, and parallel applications, we assess the performance of the different implementations at multiple levels. We use two hardware plat- forms with different performance characteristics to validate our conclusions. We show how moving protocol tasks to a relatively slow network interface can yield both performance advantages and disadvantages, depending on specific characteristics of the application and the underlying parallel- programming system. Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and Design—network communications; C.4 [Performance of Systems]: design studies; performance attributes; D.1.3 [Programming techniques]: Concurrent Programming— parallel programming General Terms: Performance, Design, Experimentation Additional Key Words and Phrases: Clusters, parallel-programming systems, system area networks 1. INTRODUCTION Modern custom network hardware allows latencies of only a few microseconds and throughputs of over a Gigabit per second, but such performance is rarely Part of this research was performed while R. A. F. Shoedjang was at Cornell University. Authors’ address: Vrije Universiteit, Faculty of Sciences, Department of Computer Science, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands. Authors’ email addresses: versto@cs. vu.nl; raoul@holmes.nl; t.ruhl@datadistilleries.com; bal@cs.vu.nl; rutger@cs.vu.nl. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. C 2004 ACM 0734-2071/04/0800-0281 $5.00 ACM Transactions on Computer Systems, Vol. 22, No. 3, August 2004, Pages 281–325.