Scalable Verification of MPI Programs Anh Vo and Ganesh Gopalakrishnan School of Computing, University of Utah, Salt Lake City, UT {avo,ganesh}@cs.utah.edu Abstract Large message passing programs today are being de- ployed on clusters with hundreds, if not thousands of proces- sors. Any programming bugs that happen will be very hard to debug and greatly affect productivity. Although there have been many tools aiming at helping developers debug MPI programs, many of them fail to catch bugs that are caused by non-determinism in MPI codes. In this work, we propose a distributed, scalable framework that can explore all relevant schedules of MPI programs to check for deadlocks, resource leaks, local assertion errors, and other common MPI bugs. 1. Author Info Author: Anh Vo Advisor: Ganesh Gopalakrishnan Number of years in PhD program:3 2. Introduction The Message Passing Interface (MPI) [7] library remains one of the most widely used APIs for implementing dis- tributed message passing programs. Its projected usage in critical, future applications such as Petascale computing [6] makes it imperative that MPI programs be free of program- ming logic bugs. This is a very challenging task considering the size and complexity of optimized MPI programs. In particular, performance optimizations often introduce many types of nondeterminism in the code. For example, the MPI Recv(MPI ANY SOURCE, MPI ANY TAG) call that can potentially match a message from any sender in the same communication group (we will later refer to this as a wildcard receive) is often used for re-initiating more work on the first sender that finishes the previous item of work. A more general version of this call is the MPI Waitsome call that waits for a subset of the previously issued communi- cation requests to finish. These nondeterministic constructs potentially can result in MPI program bugs that manifest intermittently – the bane of debugging. Traditional MPI debugging tools such as Marmot [10] insert delays during repeated testing under the same input to perturb the MPI runtime scheduling. Experience indicates that this technique is often unreliable [1]. In order to detect all scheduling- related bugs, the framework under which MPI programs are debugged needs to have the ability to determine and enforce all relevant schedules (the concept of relevant scheduled will be explained in 4.1.2). ISP (In-Situ Partial Order) [17], [18], [20], [21], the current state of the art dynamic verifier for MPI programs, is currently the only known tool that has this ability. However, the current framework of ISP does not pro- vide good scalability when operating on large clusters, an environment where many large MPI applications are currently deployed. In addition, there are bugs that only manifest when the application scales up beyond certain threshold, and some bugs only manifest under a distributed environment. The next-generation framework will have to possess the following abilities: (i) detect and enforce all relevant schedules of MPI programs, (ii) work in distributed settings, and (iii) have good scalability. To this end, we have designed DMA (Distributed Message Passing Analyzer), a scalable distributed framework that can satisfy all the above requirements. In the rest of this paper, we will provide an overview of the framework (figure 1 shows the proposed design of DMA), as well of the status of the work. 3. Related Work In recent years, considerable effort has been spent on building efficient verification tools for MPI programs such as [10], [12], [19]. However, none of those tools offer the three basic abilities that we discussed earlier (able to detect and enforce relevant interleavings, distributed, and scalable). For example, tools such as Marmot have been shown in our experiments to miss very simple deadlocking scenarios, as shown in [1]. In [8], a scalable approach to detect deadlocks in MPI programs was proposed. Yet this approach relies on the deadlock actually happening in the current run to detect it. It does not have the ability to enforce different relevant scheduling to detect whether the deadlock could have happened in another schedule. MPI-SPIN [15], [16], a model checker based on SPIN, can detect and exhaustively explores all schedules of MPI programs. However, MPI-SPIN requires the users to manu- ally build a model for the program being verified, which is an impractical task, especially for non-computer scientists. As mentioned earlier, ISP has the ability to determine and enforce all relevant schedules of MPI programs. However, 978-1-4244-6534-7/10/$26.00 ©2010 IEEE