Automatically Identifying Known Software Problems
Natwar Modani
1
, Rajeev Gupta
1
, Guy Lohman
2
, Tanveer Syeda-Mahmood
2
, Laurent Mignet
1
1
IBM India Research Lab, Block-1, IIT Delhi Campus, New Delhi, India.
2
IBM Almaden Research Center, San Jose, CA, USA
(namodani, grajeev, lamignet)@in.ibm.com; (stf, lohman)@almaden.ibm.com
Abstract
Re-occurrence of the same problem is very common
in many large software products. By matching the
symptoms of a new problem to those in a database of
known problems, automated diagnosis and even self-
healing for re-occurrences can be (partially) realized.
This paper exploits function call stacks as highly
structured symptoms of a certain class of problems,
including crashes, hangs, and traps. We propose and
evaluate algorithms for efficiently and accurately
matching call stacks by a weighted metric of the
similarity of their function names, after first removing
redundant recursion and uninformative (poor
discriminator) functions from those stacks. We also
describe a new indexing scheme to speed queries to the
repository of known problems, without compromising
the quality of matches returned. Experiments
conducted using call stacks from actual product
problem reports demonstrate the improved accuracy
(both precision and recall) resulting from our new
stack-matching algorithms and removal of
uninformative or redundant function names, as well as
the performance and scalability improvements realized
by indexing call stacks. We also discuss how call-stack
matching can be used in both self-managing (or
autonomic systems) and human “help desk”
applications.
1. Introduction
One of the biggest challenges to self-healing
systems is correctly diagnosing a problem based upon
its externalized symptoms. However, typically half –
and sometimes as much as 90 percent – of all software
problems reported by users today are re-occurrences,
or rediscoveries, of known problems, i.e. those whose
cause has already been ascertained or is under
investigation. Such rediscoveries present a significant
opportunity for automatically repairing systems by
searching a database of symptoms of known problems
to find the best match with the symptoms of any new
problem. But how to uniquely characterize any
problem by its symptoms, and how to match those
characterizations accurately, remains challenging, in
general. Fortunately, there is a large class of software
problems for which fairly structured symptoms can be
used to characterize the problem, namely those that
produce a function call stack among its symptoms.
Systems typically generate function call stacks (we’ll
use the shorter term “call stacks” henceforth) when
software “crashes”, is terminated after a “hang”, or an
error is “trapped” and reported by the code itself. Call
stacks reconstruct the sequence of function calls
leading up to the failure via the operating system’s
stack of addresses that is pushed each time a function
is called and popped when it returns. Call stacks
typically contain at least the function name and offset
in the routine at which each subroutine was invoked or
the problem occurred or was detected. This paper uses
the call stack as a symptom to characterize such
problems.
Clearly, if two call stacks are identical, they almost
surely represent the same problem, but what if they
only partially match? The function in which the
problem occurred is of course the most important to
match, but that function may not be at the top of the
stack if the code trapped the problem and invoked
some standard routines to report and/or recover from
the error, which provide little enlightenment on the
nature of the problem. Furthermore, the path by which
the execution got to the offending function may alter
the values of key parameters that contribute to a
problem, but the further in the stack we are from the
offending function, the less likely that function is to
have such an impact and the more likely it is to be
common with other problem call stacks. And if the
call stack contains recursive calls to the same
functions, the number of recursions is rarely important
for problem determination. Hence in our matching we
want to weight matches nearer the top of the stack
more heavily, after omitting redundant recursive
invocations and the “uninformative functions” such as
common error routines and entry-level routines.
433 1-4244-0832-6/07/$20.00 ©2007 IEEE.