Better Cross Company Defect Prediction Fayola Peters, Tim Menzies Lane Department of CS & EE, West Virginia University, USA fayolapeters@gmail.com, tim@menzies.us Andrian Marcus Computer Science, Wayne State University, USA amarcus@wayne.edu Abstract—How can we find data for quality prediction? Early in the life cycle, projects may lack the data needed to build such predictors. Prior work assumed that relevant training data was found nearest to the local project. But is this the best approach? This paper introduces the Peters filter which is based on the following conjecture: When local data is scarce, more information exists in other projects. Accordingly, this filter selects training data via the structure of other projects. To assess the performance of the Peters filter, we compare it with two other approaches for quality prediction. Within- company learning and cross-company learning with the Burak filter (the state-of-the-art relevancy filter). This paper finds that: 1) within-company predictors are weak for small data-sets; 2) the Peters filter+cross-company builds better predictors than both within-company and the Burak filter+cross-company; and 3) the Peters filter builds 64% more useful predictors than both within- company and the Burak filter+cross-company approaches. Hence, we recommend the Peters filter for cross-company learning. Index Terms—Cross company; defect prediction; data mining I. I NTRODUCTION Defect prediction is a method for predicting the number of defects in software. It is valuable for organizing a project’s test resources [1]. For example, given limited resources for software inspection, defect predictors can focus test engineers on the modules most likely to be defective [2]. Zimmermann et al. [3] warn that defect prediction works well within projects as long as there is a sufficient data to train models. That is, to build defect predictors, we need access to historical data. If the data is missing, what can we do? Cross Company Defect Prediction (CCDP) is the art of using data from other companies to build defect predictors. CCDP lets software companies with small unlabeled data- sets use data from other companies to build their quality predictors. Multiple recent studies have certified the utility of this approach for defect prediction [2], [4]–[6] (as well as effort estimation [7]). For example given the right relevancy filtering (described below and illustrated in Figure 1), Tosun et al. [8] used data from NASA systems to predict for defects in software for Turkish domestic appliances (and vice versa) 1 . A major issue in CCDP is how to find the right training data in a software repository. There is much data, freely available, on Software Engineering (SE) projects (e.g. this study uses 56 defect data sets from the PROMISE repository [10]). Ro- driguez et al. document 18 repositories, including PROMISE, 1 Elsewhere we have explained this surprising result by a consideration of clusters built from eigenvectors of the data [9]. Fig. 1: Aggregate the data from the repository into a TDS. Using a filter with the Test instances find F ilteredT DS ⊂ TDS. that offer software project data [11]. However much of this data is irrelevant to specific projects. Turhan et al. showed that if we use all the data from a Training Data Set (TDS) - an aggregate of multiple data-sets, then the resulting defect predictor will have excessive false alarms [2]. A more recent study by Peters et al. [12] demonstrated that if we used all the data from a TDS, then false alarms and recall would be low. When reasoning about new problems, it is wise to carefully reflect about the old data. Before we can find defects in local data, we must filter the TDS to select the most useful Filtered TDS. One such filtering method, shown in Figure 1, is the Burak filter [2] that returns the nearest TDS instance for each test instance. The core idea of the this filter is to use Test instances to guide the selection of the Filtered TDS. The Burak filter must be repeated each time new test data arrives. But is that the right way to do the filtering? Is there any advantage to learning and caching some strutures in the training data before reflecting over the test data? The following speculation argues that such an advantage might exist in the form of the Peters filter: • When one company wants to use data from many other companies, the expected case is that the Test data (from this company) is much less than the the TDS (the training data set) from all the other companies; • When Test is smaller than TDS then there should be more information about defects in the TDS than in Test. • Hence, when selecting relevant data, it might be better to guide that search using the structure of the TDS training data rather than the Test data. 978-1-4673-2936-1/13 c 2013 IEEE MSR 2013, San Francisco, CA, USA Accepted for publication by IEEE. c 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 409