Probabilistic Graph Models for Debugging Software Laura Dietz ∗ RG2: Machine Learning Max Planck Institute for Computer Science Saarbrücken, Germany dietz@mpi-inf.mpg.de Valentin Dallmeier Dept. of Computer Science Saarland University Saarbrücken, Germany dallmeier@st.cs.uni-sb.de 1 Introduction Of all software development activities, debugging—locating the defective source code statements that cause a failure—can be by far the most time-consuming. We employ probabilistic modeling to support programmers in ﬁnding defective code. Most defects are identiﬁable in control ﬂow graphs of software traces. A trace is represented by a sequence of code positions (line numbers in source ﬁlenames) that are executed when the software runs. The control ﬂow graph represents the ﬁnite state machine of the program, in which states depict code positions and arcs indicate valid follow up code positions. In this work, we extend this deﬁnition towards an n-gram control ﬂow graph, where a state represents a fragment of subsequent code positions, also referred to as an n-gram of code positions. We devise a probabilistic model for such graphs in order to infer code positions in which anomalous program behavior can be observed. This model is evaluated on real world data obtained from the open source AspectJ project and compared to the well known multinomial and multi-variate Bernoulli model [1]. Today’s best practice in software development suggests to develop two kinds of source code. Pro- duction code, which implements the functionality and will be shipped to customers; and test code, which consists of several self-contained programs (called test cases) that evaluate the correctness of routines in the production code. When a developer modiﬁes production code (i.e. ﬁxing defects or adding features) all test cases are executed. If any test cases fail, the production code contains a defect, which has to be resolved by the programmer before shipping the code. At this point a pro- grammer should be supported by predicting which parts of the production code are likely to contain defects. User interface widgets might guide the user through a list of code positions that are likely to point him to the defect. Current approaches towards the defect localization problem fall into three categories. Work of Elfeky et al. [2] relies on the fact that different programmers are likely to make the same errors when using some programming concepts, i.e. usages of complex concepts such as semaphores are likelier to be erroneous than simple operations like integer incrementation. The approach consists of learning a latent Dirichlet allocation on defective source code, treating it as a text document. The trained model can be employed to detect reoccurring defect patterns. On the downside, this approach relies on code that is manually labeled with defect topics, often created by intentionally introducing such mistakes in code. The second category (which includes our approach) draws inference on the control ﬂow of the program, i.e., the sequence of executed statements. The underlying assumptions is that code not covered by passing test cases is likely to contain defects. For that reason these techniques are also called code coverage methods. Tarantula [3] is a heuristic that yields rank scores for each statement, depending on the number of passing and failing test cases that executed this statement. So far, Tarantula is the best known algorithm to predicting defects. Our approach belongs to this category as well. * Also afﬁliated with University of Potsdam, Germany. 1