Providing Fault Tolerance To InﬁniBand Networks J. M. Monta˜ nana, J. Flich, A. Robles, P. L´ opez, and J. Duato Dept. of Computer Engineering (DISCA) Universidad Polit´ ecnica de Valencia Camino de Vera, 14, 46021–Valencia, Spain E-mail: jmontana@gap.upv.es Resumen— Currently, clusters of PCs are considered a cost- eﬀective alternative to large parallel computers. As the number of elements increases in these systems, the probability of faults increases dramatically. The- refore, it is critical to keep the system running even in the presence of faults. The interconnection network plays a key role in its performance. InﬁniBand (IBA) is a new standard interconnect suitable for clusters. Most of the fault-tolerant routing strategies proposed for massively parallel computers cannot be applied to IBA because routing and virtual channel transitions are deterministic, which prevents packets from avoi- ding the faults. A possible approach to provide fault-tolerance in IBA consists of using several disjoint paths between every source-destination pair of nodes and selecting the appropriate path at the source host. However, to this end, a routing algorithm able to provide enough disjoint paths, while still guaranteeing deadlock free- dom, is required. In this paper we address this is- sue, proposing a simple and eﬀective fault-tolerant methodology for IBA networks that can be applied to any network topology and meets the trade-oﬀ between fault-tolerance degree and the number of network re- sources devoted to it. Preliminary results show that the proposed methodology scales well and supports up to three faults in 2D and ﬁve in 3D tori using only two virtual channels. I. Introduction Over the recent years, there is a trend in using clusters of PCs for building large systems. Also, clu- sters of PCs are currently being considered as a cost- eﬀective alternative for small and large-scale parallel computing. Each time, more cluster-based systems are included into the top500 list of supercomputers. In particular, the Virginia Tech’s X [17] (with 2,200 processors) occupies the third position in the list. InﬁniBand[6] is a standard interconnect techno- logy for interconnecting processor nodes and I/O nodes to build a system area network (SAN). The InﬁniBand Architecture (IBA) is designed around a switch-based interconnect technology with high- speed serial point-to-point links connecting multi- ple independent and clustered hosts and I/O devices. Therefore, this interconnect technology is suitable to build large clusters. In many cluster-based systems, it is critical to keep the system running even in the presence of faults. These systems use a very large number of components. Each individual component can fail, This work was supported by the Spanish MCYT under Grant TIC2003-08154-C06-01 and the Generalitat Valenciana under Grant CTIDIB/2002/288, and the JCC de Castilla-La Mancha under Grant PBC-02-008. and thus, the probability of failure of the entire sy- stem increases. Although switches and links are ro- bust, they are working close to their technological limits, and increasing clock frequency leads to a hig- her power dissipation, and a higher heating could lead to premature faults. So, fault-tolerant mecha- nisms in cluster-based systems are becoming a key issue. Most of the fault-tolerant routing strategies pro- posed in the literature for massively parallel compu- ters are not suitable for clusters (see chapter 6 in [5] for a description of some of the most interesting ap- proaches). This is because they often require certain hardware support that is not provided by the current commercial interconnect technologies [1], [6]. Addi- tionally, these routing strategies have been normally designed for speciﬁc regular network topologies, like meshes and tori. However, the switch interconnec- tion pattern in clusters may be irregular. Further- more, they cannot be applied to IBA because routing is deterministic, which prevents packets from cir- cumventing the faulty components found along their paths. Also, some of these routing strategies need to perform virtual channel transitions when the packet is blocked due to a fault. However, virtual channels in IBA cannot be selected at routing time. Additio- nally to the best of our knowledge, there are no pro- posals focused on providing fault tolerance in IBA. In IBA routing and virtual channel selection is per- formed based on the destination local ID (DLID) and the service level (SL) ﬁelds of the packet hea- der. These two ﬁelds are computed at the source node and do not change along the path. Therefore, IBA routing is a kind of source routing with the rou- ting info distributed. As a consequence, a possible way to provide fault-tolerance in IBA would be to have several alternative paths between every source- destination pair, selecting one of them at the source host. In fact, IBA provides a mechanism suppor- ted by hardware [6], referred to as Automatic Path Migration (APM), which may be used for this aim. According to this mechanism, at connection setup time, the source node is given two sets of path in- formation for each destination, one for the primary path and another one for the alternate path. APM provides a fast mechanism for migration from the primary to the alternate path when a faulty compo- nent is detected in the network. Once path migration is accomplished, the alternate path is converted into the new primary path. Therefore, the subnet ma-