Toward Sharing Reasoning to Improve Fault Localization in Spreadsheets Joseph Lawrance, Margaret Burnett, Robin Abraham and Martin Erwig School of Electrical Engineering and Computer Science Oregon State University Corvallis, Oregon 97331 {lawrance,burnett,abraharo,erwig}@eecs.oregonstate.edu Abstract Although researchers have developed several ways to reason about the location of faults in spreadsheets, no single form of reasoning is without limitations. Multiple types of errors can appear in spread- sheets, and various fault localization techniques differ in the kinds of errors that they are effective in locating. Because end users who debug spreadsheets consistently follow the advice of fault local- ization systems [9], it is important to ensure that fault localization feedback corresponds as closely as possible to where the faults ac- tually appear. In this paper, we describe an emerging system that attempts to im- prove fault localization for end-user programmers by sharing the results of the reasoning systems found in WYSIWYT [13, 14] and UCheck [1, 6]. By understanding the strengths and weaknesses of the reasoning found in each system, we expect to identify where different forms of reasoning complement one another, when differ- ent forms of reasoning contradict one another, and which heuristics can be used to select the best advice from each system. By using multiple forms of reasoning in conjunction with heuristics to choose among recommendations from each system, we expect to produce unified fault localization feedback whose combination is better than the sum of the parts. 1 Introduction Spreadsheet systems like Excel are among the most widely used programming systems. Research estimates that the number of end- user programmers, which includes spreadsheet users, outnumbers professional programmers by an order of magnitude [15]. Both end-user programmers and professional programmers often make mistakes, but end-user programmers rarely possess the organized test suites and knowledge of software engineering methodologies that professional programmers have to mitigate problems. Unfortu- nately, up to 90% or more of spreadsheets contain faults [7, 10]. Be- cause spreadsheets are often used for important tasks and decisions, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. faults in them have been tied to costly errors. 1 The potential risks of spreadsheet faults extend beyond monetary costs, particularly in light of the Sarbanes-Oxley Act of 2002, a law which requires cor- porations to examine the validity of their spreadsheets [8]. Although spreadsheets are essentially a grid of cells, various infor- mation bases can be extracted out of spreadsheets, and each infor- mation base can highlight different categories of faults. For exam- ple, cells often contain explicit relationships to other cells, in the form of cell references, from which data flow graphs emerge; these data flow graphs can be used to identify reference faults 2 [5]. Fur- thermore, the juxtaposition of row and column headers against cells containing data within spreadsheets typically implies spatial rela- tionships among cells, from which unit inference graphs emerge. Unit inference can be used to identify certain types of reference, range, and omission faults [2]. Other information bases supplied by end users can assist fault localization. For example, the value of cells is often expected to fall within certain intervals; by assert- ing intervals on cells, cells whose values fall outside their intervals can be located [4, 3, 5]. Adding assertions helped significantly with non-reference faults, suggesting that the addition of assertions into the environment fills a need not met effectively by the data flow test- ing methodology alone [5]. Furthermore, in several domains, par- ticularly finance, it is often the case that two cells within a spread- sheet must add up to the same value; asserting relationships such as equality among groups of cells can be used to audit spreadsheets. Our work in progress to improve fault localization is based on the assumption that reasoning about faults in only one way is insuffi- cient to locate several different categories of faults effectively. Our emerging prototype relies on the results of the independent rea- soning systems found in UCheck and in WYSIWYT. The two sys- tems base their judgments on different information bases derived from spreadsheets: UCheck analyzes the spatial juxtaposition of row and column headers against data cells, whereas WYSIWYT uses data flow relationships in conjunction with users’ judgments to locate faults. By leveraging the reasoning produced from two different information bases, we expect to produce better feedback. We believe that sharing the results of reasoning systems in a way sufficient to locate several categories of faults requires a shared rea- soning database and heuristics to resolve competing and sometimes conflicting suggestions from different systems. 1 http://www.eusprig.org/stories.htm 2 One classification scheme we have found to be useful in our previous research involves two fault types: reference faults, which are faults of incorrect or missing references, and non-reference faults, which are all other faults.