Information Retrieval on Bug Locations by Learning Co-located Bug Report Clusters Ing-Xiang Chen Yuan Ze University Chungli, Taiwan, 320 sean@syslab.cse.yzu.edu.tw Hojun Jaygarl Iowa State University Ames, IA 50011 jaygar l@cs.iastate.edu Cheng-Zen Yang Yuan Ze University Chungli, Taiwan, 320 czyang@syslab.cse.yzu.edu.tw Ping-Jung Wu Yuan Ze University Chungli, Taiwan, 320 pjwu@syslab.cse.yzu.edu.tw ABSTRACT Bug locating usually involves intensive search activities and incurs unpredictable cost of labor and time. An issue of information retrieval on bug locations is particularly ad- dressed to facilitate identifying bugs from software code. In this paper, a novel bug retrieval approach with co-location shrinkage (CS) is proposed. The proposed approach has been implemented in open-source software projects collected from real-world repositories, and consistently improves the retrieval accuracy of a state-of-the-art Support Vector Ma- chine (SVM) model. Categories and Subject Descriptors: H.3.3 [Informa- tion Search and Retrieval]: Clustering, Search process; I.2.6 [Artificial Intelligence]: Concept learning General Terms: Algorithms, Experimentation Keywords: bug retrieval, bug report managing systems, support vector machine, co-location shrinkage 1. INTRODUCTION When faults and bugs are found in software, they are gen- erally reported as bug reports, which are composed of semi- structured text, in bug repositories for further fault tracking and debugging [3]. In the conventional process of debugging, however, intensive search, which usually involves browsing back and forth through bug reports and software code, is required to locate bugs. Unfortunately, information access techniques that require typing schemata at data-entry time are not appropriate for retrieving semi-structured text such as bug information. Accordingly, information retrieval (IR) on semi-structured text has been addressed as one of the research challenges [1]. To effectively support semi-structured bug information re- trieval, historical bug reports co-cited by the same loca- tions can be further clustered and mined. In this paper, a co-location shrinkage (CS) technique with a powerful sup- port vector machine (CS-SVM) is presented to retrieve po- tential bug locations referring to an advanced learning ap- proach [5]. The proposed approach has been implemented in three open-source software projects extracted from real- Copyright is held by the author/owner(s). SIGIR’08, July 20–24, 2008, Singapore. ACM 978-1-60558-164-4/08/07. world bug report managing systems (BRMS). With the pro- posed method, the accuracy of retrieving correct bug lo- cations can be consistently raised from 4.2% to 31.8% by providing a recommendation list of 10 bug locations. 2. PROBLEM STATEMENT As in common debugging procedures, the bug locations corresponding to a bug report are usually retrieved from version archives and BRMS. To retrieve the potential bug locations L, the prediction of L can be denoted as P (B | L), where an incoming bug report B is examined and predicted according to its meta-information. The relationships be- tween bug reports and locations can be 1-to-many mappings since a bug may be fixed in several locations. Hence, bug prediction is defined as providing users with a recommen- dation list of possible bug locations. This study retrieves the location of faults on the level of files. Similarly, the bug retrieval approach can be applied to other levels of source elements such as packages, classes, and methods with minor changes. 3. THE BUG RETRIEVAL APPROACH Figure 1 illustrates a snippet to explain the idea of clus- tering the co-located bug reports. A co-location shrinkage (CS) technique is further presented to strengthen the se- mantic connection of bug reports in the co-located clusters referring to [5]. Figure 2 depicts the CS algorithm with shrinking the co-located bug reports. In advance to retrieve the potential bug locations, pairs of bug reports and their fixed locations are trained and tested by the CS-SVM model. Figure 3 illustrates the overview of the bug retrieval pro- cess. Bug reports were preprocessed as bag-of-words by stan- dard IR techniques, namely, tokenization, stemming, and stopword removal, and then transformed into vector rep- resentation with TF · IDF weighting. Finally, the debug- ging knowledge extracted from historical bug reports were learned by the proposed CS-SVM model, and new coming bug reports are tested to predict the bug locations. 4. EXPERIMENTAL RESULTS The proposed CS-SVM scheme has been evaluated with three open-source projects, Subversion (SVN), AspectJ, and ArgoUML. To evaluate the fixed locations, bug reports re- 801