Scalable Verification of MPI Programs
Anh Vo and Ganesh Gopalakrishnan
School of Computing, University of Utah, Salt Lake City, UT
{avo,ganesh}@cs.utah.edu
Abstract
Large message passing programs today are being de-
ployed on clusters with hundreds, if not thousands of proces-
sors. Any programming bugs that happen will be very hard
to debug and greatly affect productivity. Although there have
been many tools aiming at helping developers debug MPI
programs, many of them fail to catch bugs that are caused by
non-determinism in MPI codes. In this work, we propose a
distributed, scalable framework that can explore all relevant
schedules of MPI programs to check for deadlocks, resource
leaks, local assertion errors, and other common MPI bugs.
1. Author Info
Author: Anh Vo
Advisor: Ganesh Gopalakrishnan
Number of years in PhD program:3
2. Introduction
The Message Passing Interface (MPI) [7] library remains
one of the most widely used APIs for implementing dis-
tributed message passing programs. Its projected usage in
critical, future applications such as Petascale computing [6]
makes it imperative that MPI programs be free of program-
ming logic bugs. This is a very challenging task considering
the size and complexity of optimized MPI programs.
In particular, performance optimizations often introduce
many types of nondeterminism in the code. For example, the
MPI Recv(MPI ANY SOURCE, MPI ANY TAG) call that
can potentially match a message from any sender in the
same communication group (we will later refer to this as a
wildcard receive) is often used for re-initiating more work
on the first sender that finishes the previous item of work. A
more general version of this call is the MPI Waitsome call
that waits for a subset of the previously issued communi-
cation requests to finish. These nondeterministic constructs
potentially can result in MPI program bugs that manifest
intermittently – the bane of debugging. Traditional MPI
debugging tools such as Marmot [10] insert delays during
repeated testing under the same input to perturb the MPI
runtime scheduling. Experience indicates that this technique
is often unreliable [1]. In order to detect all scheduling-
related bugs, the framework under which MPI programs are
debugged needs to have the ability to determine and enforce
all relevant schedules (the concept of relevant scheduled will
be explained in 4.1.2). ISP (In-Situ Partial Order) [17], [18],
[20], [21], the current state of the art dynamic verifier for
MPI programs, is currently the only known tool that has this
ability.
However, the current framework of ISP does not pro-
vide good scalability when operating on large clusters,
an environment where many large MPI applications are
currently deployed. In addition, there are bugs that only
manifest when the application scales up beyond certain
threshold, and some bugs only manifest under a distributed
environment. The next-generation framework will have to
possess the following abilities: (i) detect and enforce all
relevant schedules of MPI programs, (ii) work in distributed
settings, and (iii) have good scalability. To this end, we have
designed DMA (Distributed Message Passing Analyzer), a
scalable distributed framework that can satisfy all the above
requirements.
In the rest of this paper, we will provide an overview
of the framework (figure 1 shows the proposed design of
DMA), as well of the status of the work.
3. Related Work
In recent years, considerable effort has been spent on
building efficient verification tools for MPI programs such
as [10], [12], [19]. However, none of those tools offer the
three basic abilities that we discussed earlier (able to detect
and enforce relevant interleavings, distributed, and scalable).
For example, tools such as Marmot have been shown in our
experiments to miss very simple deadlocking scenarios, as
shown in [1]. In [8], a scalable approach to detect deadlocks
in MPI programs was proposed. Yet this approach relies
on the deadlock actually happening in the current run to
detect it. It does not have the ability to enforce different
relevant scheduling to detect whether the deadlock could
have happened in another schedule.
MPI-SPIN [15], [16], a model checker based on SPIN,
can detect and exhaustively explores all schedules of MPI
programs. However, MPI-SPIN requires the users to manu-
ally build a model for the program being verified, which is
an impractical task, especially for non-computer scientists.
As mentioned earlier, ISP has the ability to determine and
enforce all relevant schedules of MPI programs. However,
978-1-4244-6534-7/10/$26.00 ©2010 IEEE