Adapting Dynamic Core Coupling to a direct-network environment Daniel S´anchez, Juan L. Arag´on and Jos´ e M. Garc´ ıa 1 Abstract — To obtain benefit of the increasing tran- sistor count in current processors, designs are lead- ing to CMPs that will integrate tens or hundreds of processor cores on-chip. However, scaling and voltage factors are increasing susceptibility of architectures to transient, intermittent and permanent faults, as well as process variations. A very recent solution found in literature consists of Dynamic Core Coupling (DCC) [1]. DCC provides a fault tolerant framework based on dynamic binding of cores for re-execution. This technique relies on the use of a shared-bus. However, for current and future CMP architectures, more efficient designs are tiled- CMPs, which are organized around a direct network, since area, scalability and power constraints make im- practical the use of a bus as the interconnection net- work. In this work, we present the changes needed in the original DCC proposal to be used for a direct net- work environment. These changes are mostly due to the replacement of a bus for a mesh as interconnection network, the coherence protocol and the consistency window. Our evaluations show that, for several par- allel scientific applications, the performance overhead with this new environment rises to 10%, 19%, 42.5% and 47% for 4, 8, 16 and 32 core pairs, respectively, compared to the 5% performance degradation as pre- viously reported for 8 core pairs in the original DCC proposal. I. Introduction Nowadays, market trends are positioning CMPs as the best way to use the big number of transistors that we can accommodate in a chip. However, due to the raise of the number of transistors per chip, the failure ratio is increasing more and more in every new scale generation. On one hand, the actual larger number of transistors in a chip enlarges the probability of fault. On the other, the increase of the temperature and the decrease of the voltage in the chip leads to a higher susceptibility to transient faults. A transient fault is a flip in one or more bits. It may be caused as a result of the impact of an alpha particle on the chip or causes such as power supply noise and signal cross talking. All zones in the chip are vulnerable to this kind of faults, therefore, fault tolerance mecha- nisms must be designed to avoid incorrect program executions. Moreover, these techniques always have both a hardware cost because of the additional extra hardware required to re-execute instructions, and a performance cost because of the actions needed to assure a correct execution. The family of techniques SRT [2], SRTR [3], CRT [4] and CRTR [5] are based on a previous proposal called AR-SMT [6], in which redundant threads exe- cute the same instructions in an SMT processor with a performance degradation between 10-30%. In all 1 Dpto. de Ingenier´ ıa y Tecnolog´ ıa de Computa- dores, Univ. de Murcia, e-mail: {dsanchez, jlaragon, jmgarcia}@ditec.um.es these studies, the fault tolerance is achieved by re- dundant execution in two different execution cores (or threads) called leading/master and trailing/slave. The master core runs some instructions ahead of the slave and they communicate with each other by some different structures, like the LVQ, StB or RVQ. Although applicable to sequential programs, these techniques are not directly valid for executing paral- lel programs due to incoherences in memory values called input incoherences [7]. An input incoherence is a phenomenon that oc- curs when two dynamic loads do not obtain the same value from memory. This problem is very common in parallel programs when a redundant core executes the same instruction few cycles later. Reunion [7] addresses this problem with a new paradigm called relaxed input replication, in which the master issues non-coherent accesses to memory (phantom request ) while the slave core issues real coherent accesses. If a difference is detected because of an input incoher- ence, it is marked as a transient fault when indeed it is not. In order to avoid the need of intermediate struc- tures to communicate the leading and the trailing cores, another option could be the periodic creation of checkpoints. To detect any fault between two checkpoints, the master and the slave interchange a signature or a hash resuming the current state, detecting a fault if they differ. The recovery mecha- nism is as easy as going back to the last successfully verified checkpoint, which establishes a safe point. A very recent study on this fashion is made by LaFrieda et al. in DCC [1]. DCC is a promising approach to achieving fault tolerance in multiprocessors, based on a shared bus. In the present work, we analyse and evaluate how DCC behaves in a scalable tiled-CMP architecture. We show that the DCC execution time overhead is more noticeable than previously reported when con- sidering direct networks. We have evaluated in detail the scalability, the influence of cache associativity, the delay of L1 replacements and the total network traffic. We have found that the main cause for this increasing execution time overhead is the mechanism used by DCC to assure the memory consistency be- tween master and slave cores. The rest of the document is organized as follows: Section II reviews how DCC operates and points out its major weaknesses. Section III presents how to mi- grate DCC to work under a direct network instead of a shared-bus. Section IV introduces the method- ology employed in the evaluation. Section V shows the performance results. Section VI summarizes the Castell´ on, Septiembre 2008 253 Actas de las XIX Jornadas de Paralelismo, pp. 253-258, 2008. ISBN: 978-84-8021-676-0