Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level Ricardo Ferna ´ndez-Pascual, Jose ´ M. Garcı ´a, Member, IEEE, Manuel E. Acacio, and Jose ´ Duato, Member, IEEE Abstract—The importance of transient faults is predicted to grow due to current technology trends of increased scale of integration. One of the components that will be significantly affected by transient faults is the interconnection network of chip multiprocessors (CMPs). To deal efficiently with these faults and differently from other authors, we propose to use fault-tolerant cache coherence protocols that ensure the correct execution of programs when not all messages are correctly delivered. We describe the extensions made to a directory-based cache coherence protocol to provide fault tolerance and provide a modified set of token counting rules which are useful to design fault-tolerant token-based cache coherence protocols. We compare the directory-based fault-tolerant protocol with a token- based fault-tolerant one. We also show how to adjust the fault tolerance parameters to achieve the desired level of fault tolerance and measure the overhead achieved to be able to support very high fault rates. Simulation results using a set of scientific, multimedia, and commercial applications show that the fault tolerance measures have virtually no impact on execution time with respect to a non-fault- tolerant protocol. Additionally, our protocols can support very high rates of transient faults at the cost of slightly increased network traffic. Index Terms—fault tolerance, cache coherence, transient faults, interconnection network. Ç 1 INTRODUCTION C HIP Multiprocessors (CMPs) have become the preferred way to effectively take advantage of the increased availability of transistors while keeping design complexity manageable. Further, tiled architectures which are built by replicating several tiles comprised by a core, private cache, part of the shared cache, and a network interface help in keeping complexity more manageable, scale well to a larger number of cores, and support families of products with varying number of tiles. In this way, it seems likely that they will be the choice for future many-core CMP designs [23], [24]. Fig. 1b shows a 16-core CMP organized by replicating the tile structure shown in Fig. 1a. A main drawback of current technology trends is that, due to the miniaturization and the lower voltages used for power efficiency reasons, the susceptibility of future chips to transient faults will increase. Transient faults [3], [17], also known as soft errors or single event upsets, occur when a component produces an erroneous output but continues working correctly after the event. Any event which upsets the stored or communicated charge can cause soft errors. Typical causes include alpha-particle strikes, cosmic rays, radiation from radioactive atoms which exist in trace amounts in all materials, and electrical sources like power supply noise, electromagnetic interference (EMI), or radia- tion from lightning. Reliability is not only required for some critical applica- tions: even for commodity systems, reliability needs to be above a certain level for the system to be useful for anything. In fact, since the number of components in a chip increases and the reliability of each component decreases, it is no longer economical to design new chips and test assuming a worst case reliability scenario. Instead, new designs will target the common case and assume a certain rate of transient faults. Hence, transient faults will have to be handled across all the levels of the system to avoid actual errors. Transient faults are already a problem for memories and caches which routinely use error detection and correction codes (ECC) to deal with them. Other parts of the system will need to use fault tolerance techniques to deal with transient faults as their frequency increases. One of the components which will be affected by transient faults in a CMP is the interconnection network. It occupies a significant part of the chip real estate and is critical to the performance of the system. It handles the communication between the cores and caches, which is done by means of a cache coherence protocol. Communica- tion is usually very fine-grained (at the level of cache lines) and requires very small and frequent messages. Hence, to achieve good performance, the interconnection network must provide very low latency. Fault tolerance in the interconnection network has traditionally been provided at the network level. Several proposals on how to do this are mentioned in Section 2. Ensuring the reliable transmission of all messages through the network imposes significant overheads in latency, power consumption, and area. Differently from other authors, we propose to deal with transient faults in the IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 8, AUGUST 2010 1117 . R. Ferna´ndez-Pascual, J.M. Garcı´a, and M.E. Acacio are with the Departamento de Ingenierı´a de Computadores, Facultad de Informa´tica, Universidad de Murcia, Campus de Espinardo, Murcia 30100, Spain. E-mail: {rfernandez, jmgarcia, meacacio}@ditec.um.es. . J. Duato is with the Departamento de Informatica de Sistemas y Computadores, Universidad Polite´cnica de Valencia, Camino de Vera, s/n Valencia 46022, Spain. E-mail: jduato@disca.upv.es. Manuscript received 16 Mar. 2009; revised 3 Aug. 2009; accepted 14 Aug. 2009; published online 28 Aug. 2009. Recommended for acceptance by R. Bianchini. For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-2009-03-0120. Digital Object Identifier no. 10.1109/TPDS.2009.148. 1045-9219/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society