Non-threaded and Threaded Approaches to MultiRail Communication with uDAPL Jie Cai # , Alistair P. Rendell * , Peter E. Strazdins + School of Computer Science, the Australian National University Canberra ACT 0200 Australia # Jie.Cai@anu.edu.au * Alistair.Rendell@anu.edu.au + Peter.Strazdins@anu.edu.au Abstract uDAPL is a portable and platform independent commu- nication library that provides RDMA as well as send/recv operations. Some well-known software has attempted to take advantage of uDAPL’s portability, such as Open MPI, MVAPICH2, Intel MPI, and Cluster OpenMP. However, the network bandwidth limitation can still be a bottleneck for applications using these software. Engaging a “Multirail” network is a method to by-pass this. In this paper, we design a non-threaded and a threaded approach to improve the performance of uDAPL over multirail configured clusters. The two approaches are evaluated on an InfiniBand cluster with different multirail configurations. The results show that the threaded approach improves 33% and 148% of the uni-directional bandwidth on the multi-port and the multi-HCA configured network respectively, and the non- threaded approach improves 90% of the uni-directional bandwidth on the multi-HCA configured network. A similar improvements is achieved for the bi-directional bandwidth. 1. Introduction The user-level direct access programming library (uDAPL) defined by the DAT Collaborative attempts to provide a network, architecture and operating system independent interface to applications for remote direct memory access (RDMA) communications [1]. This potentially allows applications to seamlessly use different networks as the underlying transport with minimal effort. Nowadays, some well-known communication libraries and applications are trying to take advantage of the better portability of uDAPL, such as MVAPICH2 [2], [3], Intel MPI Library [4], Open MPI [5], and Cluster OpenMP [6], [4]. InfiniBand is an emerging networking technology support- ing low latency, high bandwidth and RDMA communica- tions. At the end of 2008, 36% of the top 50 supercomputers listed in TOP500 [7] are using InfiniBand as their intercon- nection network. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some applications, especially for clusters built with multi- core machines [8]. A well-known method to overcome the bandwidth limitation is to use “multirail networks” [9], [8], [10]. However, none of those attempts solved the problem with a generic/portable solution; either the socket applica- tion programming interface (API) or the InfiniBand Verb API was targeted. Furthermore, only a single process was engaged to handle communications over both rails for those attempts. Hence, a question, “Can significant performance improve- ment be achieved through a portable and platform indepen- dent communication library?”, arises. As some InfiniBand network interface cards, known as host channel adapters (HCA), provide dual-ports, there are some different ways to configure a multirail network. Therefore, another question, “What is the best way to pursue the performance improve- ment on different configured multirail networks?”, arises as well. In this paper, we will answer both questions. To explore these, a set of uDAPL multirail benchmarks will be designed. Two different design approaches, threaded and non-threaded, are developed. The two approaches will be evaluated on an InfiniBand cluster with different multirail configurations. The rest of this paper will be managed in five sections. In section 2 and 3, background knowledge of uDAPL and multirail networks will be introduced. In section 4, design issues of multirail uDAPL benchmarks for the two different approaches will be discussed, and illustrated. The benchmarks will be evaluated on an InfiniBand cluster in section 5 followed by conclusions and future work in section 6. 2. uDAPL The Direct Access Programming Library (DAPL) was re- leased by DAT Collaborative [11] as an attempt to provide a portable set of API for all RDMA networks. DAPL contains a kernel and user space specification in the C programming language, which are kDAPL and uDAPL respectively. User programs use uDAPL for generic RDMA applications [1]. Abstract concepts are represented as objects in the uDAPL specifications. For example, an Interface Adapter (IA) is the