Measuring the Semantic Similarity of Comments in Bug Reports Bogdan Dit, Denys Poshyvanyk, Andrian Marcus Department of Computer Science Wayne State University Detroit Michigan 48202 313 577 5408 <bdit, denys, amarcus>@wayne.edu Abstract Bug-tracking systems, such as Bugzilla, contain a large amount of information about software defects, most of it stored in textual, rather than structured form. This information is used not only for locating and fixing the bugs, but also for detecting bug duplicates, triaging incoming bugs, automatically assigning bugs to developers, etc. Given the importance of the textual information in the bug reports, it is desirable that this text is highly coherent, such that the readers can easily understand it. The paper describes an approach to measuring the textual coherence of the user comments in bug reports. The coherence of bug reports from Eclipse was measured and the results are discussed in the paper. 1. Introduction A large part of software development and maintenance is spent on locating and fixing bugs. It is common to use in large projects defect reporting and tracking systems, such as Bugzilla 1 . Such systems collect a lot of information about identified defects, most of it in natural text, such as bug descriptions, user comments, etc. The information provided in these bug reports influences the time it takes to fix the bugs [2, 16] and it can be used to support tasks, such as, impact analysis [3, 4], detection of duplicate bug reports [13, 14], or assigning bug reports to developers [1, 5, 7]. It has been shown that bug reports greatly differ in their quality of information [10, 11, 15]. The proposed quality models ignore the user comments posted in the bug reports. We argue that good bug reports should contain not only good textual descriptions of the problem and properly selected attributes, but also coherent and relevant comments. 1 http://www.bugzilla.org In this paper we propose a novel approach to measure the textual coherence of user comments in bug reports. We consider that the textual coherence of user comments affects the comprehensibility of bug reports hence it is important to measure it. Our measuring technique relies on the utilization of Information Retrieval (IR) techniques, which allows for automatic coherence measurement of user comments in large bug repositories. We measured the coherence of bug reports from Eclipse 2 and our preliminary results suggest that the proposed measure correlates with assessments provided by software developers. 2. Background and Motivation Bug-tracking repositories provide means of communication among geographically distributed developers and teams. The developers can describe and issue new bug reports, comment on existing bug reports, suggest fixes to the bugs, subscribe to e-mail discussions for specific bug reports, etc. An individual record in a bug-tracking database is referred to as an issue or bug report. A typical bug report consists of several components such as title (or short summary); attributes or pre-defined fields such as bug report id number, creation date, reporter, product, component, operating system, version, priority, severity, e-mail addresses of developers on the mailing list for the bug, etc.; long description and comments, which are posted by developers. Bugzilla’s published usage rules specify the following about writing comments: “If you are changing the fields on a bug, only comment if either you have something pertinent to say, or Bugzilla requires it. Otherwise, you may spam people unnecessarily with bug mail.” 3 Each project usually defines its own guidelines on how to post comments in the bug reports. For example, 2 http://www.ecplise.org 3 http://www.bugzilla.org/docs/2.18/html/hintsandtips.html