Proceedings of the 2001 Winter Simulation Conference B. A. Peters, J. S. Smith, D. J. Medeiros, and M. W. Rohrer, eds. BENEFITS FROM SEMI-ASYNCHRONOUS CHECKPOINTING FOR TIME WARP SIMULATIONS OF A LARGE STATE PCS MODEL Andrea Santoro Francesco Quaglia Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” Via Salaria 113, 00198 Roma, ITALY ABSTRACT Checkpointing overhead is a major obstacle for the effec- tiveness of Time Warp parallel discrete event simulators. Semi-asynchronous checkpointing is a recent solution to tackle this obstacle for Time Warp simulations on distributed memory systems based on Myrinet. In this solution, check- point operations are offloaded from the host CPU and are charged to a DMA engine on board of Myrinet network cards. In this paper we report an empirical evaluation of the benefits from semi-asynchronouscheckpointing for Time Warp simulations of a large state Personal Communication System (PCS) model. PCS simulation models are typically characterized by high communication locality among the LPs hosted by the same machine, therefore the hardware on board of the Myrinet cards is typically underutilized if used to support exclusively communication. We show that the execution speed of Time Warp simulations of a large state PCS model can be increased when semi-asynchronous checkpointing is adopted. 1 INTRODUCTION Time Warp parallel discrete event simulators are based on checkpointing and rollback recovery techniques to ensure causally consistent execution of simulation events at each Logical Process (LP) (Jefferson 1985). It is widely recog- nized that a central factor affecting the performance of this type of simulators is the way in which checkpoint operations are executed. Commonly, checkpoint operations are charged to the CPU and the reduction of the checkpointing overhead has been pursued by the use of checkpointing strategies based on infrequent or incremental saving of the LP state vector, see for example (Bauer and Sporrer 1993, Bellenot 1992, Fleischmann and Wilsey 1995, Lin et al. 1993, Quaglia 1999, Quaglia 2001, Ronngren and Ayani 1994, Skold and Ronngren 1996, Steinman 1993, Unger et al. 1993). These solutions pay the price of an increase in the expected rollback latency since a state to be recovered might not be available, in which case it must be reconstructed during the rollback phase. The “best suited” tradeoff is typically achieved through adequate tuning of the proper parameter(s) of the checkpointing strategy. A completely different approach to the implementation of checkpoint operations has been recently proposed in (Quaglia and Santoro 2001) for the case of Time Warp simulation on distributed memory systems based on Myrinet. Specifically, the work in (Quaglia and Santoro 2001) presents a Checkpointing and Communication Library (CCL) that exploits data transfer potentiality offered by programmable DMA engines on board of Myrinet network cards to support not only communication but also checkpoint operations. In this way, checkpoint operations are offloaded from the CPU, thus allowing the CPU itself to perform other simulation specific operations (e.g. event list’s update, event execution) while checkpointing is in progress. On the other hand, DMA based checkpointing could suffer from data inconsistency whenever the content of a state buffer is accessed for further modifications while a checkpoint operation involving it is not yet completed. To avoid this, CCL includes also functionalities to suspend on demand the execution of the simulation program in or- der to wait, if needed, the completion of a pending DMA based checkpoint operation. This leads to the so called semi-asynchronous execution mode of checkpointing. Pre- liminary performance results (Quaglia and Santoro 2001, Quaglia, Santoro and Ciciani 2001) have shown that this mode is an effective solution to reduce the completion time of the simulation by reducing the delay associated with any single checkpoint operation. However, semi-asynchronous checkpointing produces extra-utilization of the hardware on board of the Myrinet network card since that hardware is not used to support com- munication functionalities alone. Such an extra-utilization might harm the performance of the communication subsys- tem, thus possibly originating an increase in the amount of rollback (Carothers, Fujimoto and England 1994), which 1339