Application-Specific Fault Tolerance via Data Access Characterization Nawab Ali 1 , Sriram Krishnamoorthy 1 , Niranjan Govind 1 , Karol Kowalski 1 , and Ponnuswamy Sadayappan 2 1 Pacific Northwest National Laboratory, Richland, WA {nawab.ali,sriram,niri.govind,karol.kowalski}@pnl.gov 2 The Ohio State University, Columbus, OH saday@cse.ohio-state.edu Abstract. Recent trends in semiconductor technology and supercomputer de- sign predict an increasing probability of faults during an application’s execution. Designing an application that is resilient to system failures requires careful eval- uation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance ap- proaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the infor- mation collected. The application signatures developed capture application char- acteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance. Keywords: Fault tolerance, Data access characterization, NWChem. 1 Introduction The increasing component counts in modern supercomputer designs, coupled with a de- crease in micro-architectural feature size, and considerations of power envelope predict a significant decrease in the mean time between failures (MTBF) of the next generation of leadership-class machines [27]. Long-running scientific applications should expect multiple failures, both hard and transient, during execution. This has increased the need for applications to incorporate capabilities to identify and make forward progress in the presence of faults. Making a large-scale scientific application fault tolerant is an arduous task. The first step involves evaluating different fault tolerance approaches and quantifying their im- pact in terms of space and time overhead, the amount of work lost in the event of a fault, and the feasibility of incorporating the fault tolerance approaches into the application. In this paper, we present our approach to evaluating key modules of NWChem [32,17], a large computational chemistry application consisting of close to two million lines of code. NWChem is a widely used computational chemistry suite shown to scale on the largest systems. Understanding the key characteristics of such a large application through study of the source code is a daunting task. This has long been recognized by performance tools E. Jeannot, R. Namyst, and J. Roman (Eds.): Euro-Par 2011, LNCS 6853, Part II, pp. 340–352, 2011. c Springer-Verlag Berlin Heidelberg 2011