Fault Tolerant Algorithms for Network-On-Chip Interconnect M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, M. J. Irwin The Pennsylvania State University University Park, PA 16802 email: {vijay, pirretti, link}@cse.psu.edu This paper appeared in the proceedings of ISVLSI, February 2004 Abstract As technology scales, fault tolerance is becoming a key concern in on-chip communication. Consequently, this work examines fault tolerant communication algorithms for use in the NoC domain. Two different flooding algorithms and a random walk algorithm are investigated. We show that the flood-based fault tolerant algorithms have an ex- ceedingly high communication overhead. We find that the redundant random walk algorithm offers significantly re- duced overhead while maintaining useful levels of fault tol- erance. We then compare the implementation costs of these algorithms, both in terms of area as well as in energy con- sumption, and show that the flooding algorithms consume an order of magnitude more energy per message transmit- ted. 1 Index Terms: Fault Tolerance, Network on Chip, Random Walk, Flooding. 1. Introduction As technology scales, errors and faults are becoming in- creasingly common. Crosstalk interferes with signal trans- mission, while soft errors result in random bit-flips through- out the design [3]. Manufacturing faults can result in en- tirely non-functional segments of the circuit. The ITRS notes that relaxing the 100% correctness requirement for designs can result in lowered costs of manufacturing, ver- ification, and testing [1]. Designing systems that operate in the presence of transient or permanent failures has been studied for years, but research has focused mostly on large- scale systems and their interconnect. The Network-On-Chip (NoC) design paradigm has been proposed as the future of ASIC design [2]. The NoC design methodology connects different IP blocks by a packet-based on-chip network. This is markedly different from traditional large-scale interconnects, due to the limited area resources available on the chip, as well as the increased dependence on low-latency communication. Implementing traditional fault tolerant algorithms in the NoC domain is infeasible due to these area restrictions, so 1 This research is sponsored by the Defense Advance Research Projects Agency (DARPA), and administered by the Army Research Office under Emergent Surveillance Plexus MURI Award No. DAAD19-01-1-0504. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the sponsoring agencies. other techniques must be developed if fault tolerant ASICs are to become possible. Previous work in this area is ex- tremely limited as NoC design is still in its infancy. Dumi- tras et al. [4] have focused on stochastic communication for a number of reasons. Stochastic communication has been shown to cope well with random faults, requires no large lookup tables or reconfiguration, and requires very few re- transmissions overall. Dumitras et al. propose a probabilis- tic flooding scheme based upon well known gossip tech- niques as a possible fault tolerant solution [6] [7]. Flooding is an effective fault tolerant technique because it is highly fault tolerant. If a path exists to the destination, a mes- sage will almost certainly arrive. In practice, however, this level of fault tolerance may not be necessary. Resilience to a much smaller number of faults still offers increased chip yields, as well as resistance to transient failures, but also reduces unnecessary packet transmissions. Lowering the number of unnecessary packet transmissions allows for higher network throughput, and a more efficient use of the interconnect. This paper investigates the previously proposed proba- bilistic flooding algorithm, as as well as two new fault toler- ant algorithms, the directed flood and the N-random walk. To more accurately gauge the effectiveness of these algo- rithms in the NoC design space, we designed the compo- nents in Verilog, and used synthesis tools to determine the area and energy requirements of the hardware. The remainder of this paper begins by discussing the al- gorithms and their implementation. We then compare the performance of the algorithms using a NoC simulator, and show that the directed flood has better fault tolerance and lower overhead than the probabilistic flooding algorithm. The N-random walk algorithm has nearly the same level of fault tolerance, but requires an order of magnitude fewer packet transmissions. The implementation costs of the al- gorithms are presented, and we find that all three algorithms are similar in terms of area cost, but that the flooding algo- rithms consume an order of magnitude more energy than the N-random walk algorithm proposed. 2 Fault Tolerant Algorithms A number of different fault tolerant methods of commu- nication for large-scale systems have been proposed [5] [12]. These algorithms are not amenable for NoC imple- mentation due to significant area and storage overhead. Re- cently, work has begun on fault tolerant algorithms specif- ically for the NoC space, with most consisting of various forms of gossip algorithms [4]. We investigate two new fault tolerant algorithms, and compare them to the previ-