A Routing Methodology for Achieving Fault Tolerance in Direct Networks Marı ´a Engracia Go ´mez, Member, IEEE, Nils Agne Nordbotten, Jose ´ Flich, Pedro Lo ´pez, Member, IEEE Computer Society, Antonio Robles, Member, IEEE Computer Society, Jose Duato, Member, IEEE, Tor Skeie, and Olav Lysne, Member, IEEE Abstract—Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance. Index Terms—Fault tolerance, direct networks, adaptive routing, virtual channels, bubble flow control. æ 1 INTRODUCTION T HERE exist many compute-intensive applications that require a huge amount of processing power (nuclear weapons simulations, protein folding, global climate modeling, galaxy interaction simulations, etc.). These applications require continued research and technology development to deliver computers with steadily increasing computing power. The required levels of computing power can only be achieved with massively parallel computers, such as the Earth Simulator [19], the ASCI Red [1], and the BlueGene/L [5]. The huge number of processors and associated devices (memories, switches, and links, etc.) significantly affects the probability of failure. Each individual component can fail and, thus, the probability of failure of the entire system increases dramatically. One of the JASON Defense Advi- sory Panel reports from 2003, about the requirements for ASCI, states that “Scaling to PetaFlop using present machine architectures implies very large number of processors—of order 100,000, perhaps—might be needed. Such large numbers raises serious questions of scalability of code performance and of machine reliability.” Thus, in these systems, it is critical to keep the system running, even in the presence of failures. In addition, failures in the interconnection network may isolate a large fraction of the machine, containing many healthy proces- sors that otherwise could have been used. Although network components, like switches and links, are robust, they are working close to their technological limits and, therefore, they are prone to failures. Increasing clock frequencies leads to a higher power dissipation, which again could lead to premature failures. Therefore, fault- tolerant mechanisms for interconnection networks are becoming a critical design issue for large massively parallel computers [25], [47], [26], [48], [37], [38]. Faults can be classified as transient or permanent. Transient faults are usually handled by communication protocols, using CRCs to detect faults and retransmitting packets. In order to deal with permanent faults in a system, two fault models can be used: static or dynamic. In a static fault model, it is assumed that all the faults are known in advance when the machine is (re)booted. In order to implement it, once a fault is detected, all the processes in the system are halted, the network is emptied, and a management application is run in order to deal with the faulty component. The management application detects where the fault is, computes the information required by the nodes in order to tolerate the fault, and distributes the information. Then, the system is rebooted and the processes are resumed. This fault model needs to be combined with checkpointing techniques in order to be effective. Applying checkpointing minimizes the fault’s impact on applications because they are restarted from the latest checkpoint. In a dynamic fault model, once a new fault is found, actions are 400 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 4, APRIL 2006 . M.E. Go´mez, J. Flich, P. Lo´pez, A. Robles, and J. Duato are with the Department of Computer Engineering, Universidad Polite´cnica de Valencia, Camino de Vera, 14, 46071-Valencia, Spain. E-mail: {megomez, jflich, plopez, arobles, jduato}@disca.upv.es. . N.A. Nordbotten, T. Skeie, and O. Lysne are with the Simula Research Laboratory, PO Box 134, N-1325, Lysaker, Norway. E-mail: {nilsno, tskeie, olavly}@simula.no. The first two authors are listed in alphabetical order. Manuscript received 7 Feb. 2005; revised 5 Aug. 2005; accepted 5 Oct. 2005; published online 22 Feb. 2006. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-0037-0205. 0018-9340/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: UNIVERSIDAD POLITECNICA DE VALENCIA. Downloaded on November 4, 2009 at 11:13 from IEEE Xplore. Restrictions apply.