Better Cross Company Defect Prediction Fayola Peters, Tim Menzies Lane Department of CS & EE, West Virginia University, USA fayolapeters@gmail.com, tim@menzies.us Andrian Marcus Computer Science, Wayne State University, USA amarcus@wayne.edu Abstract—How can we ﬁnd data for quality prediction? Early in the life cycle, projects may lack the data needed to build such predictors. Prior work assumed that relevant training data was found nearest to the local project. But is this the best approach? This paper introduces the Peters ﬁlter which is based on the following conjecture: When local data is scarce, more information exists in other projects. Accordingly, this ﬁlter selects training data via the structure of other projects. To assess the performance of the Peters ﬁlter, we compare it with two other approaches for quality prediction. Within- company learning and cross-company learning with the Burak ﬁlter (the state-of-the-art relevancy ﬁlter). This paper ﬁnds that: 1) within-company predictors are weak for small data-sets; 2) the Peters ﬁlter+cross-company builds better predictors than both within-company and the Burak ﬁlter+cross-company; and 3) the Peters ﬁlter builds 64% more useful predictors than both within- company and the Burak ﬁlter+cross-company approaches. Hence, we recommend the Peters ﬁlter for cross-company learning. Index Terms—Cross company; defect prediction; data mining I. I NTRODUCTION Defect prediction is a method for predicting the number of defects in software. It is valuable for organizing a project’s test resources [1]. For example, given limited resources for software inspection, defect predictors can focus test engineers on the modules most likely to be defective [2]. Zimmermann et al. [3] warn that defect prediction works well within projects as long as there is a sufﬁcient data to train models. That is, to build defect predictors, we need access to historical data. If the data is missing, what can we do? Cross Company Defect Prediction (CCDP) is the art of using data from other companies to build defect predictors. CCDP lets software companies with small unlabeled data- sets use data from other companies to build their quality predictors. Multiple recent studies have certiﬁed the utility of this approach for defect prediction [2], [4]–[6] (as well as effort estimation [7]). For example given the right relevancy ﬁltering (described below and illustrated in Figure 1), Tosun et al. [8] used data from NASA systems to predict for defects in software for Turkish domestic appliances (and vice versa) 1 . A major issue in CCDP is how to ﬁnd the right training data in a software repository. There is much data, freely available, on Software Engineering (SE) projects (e.g. this study uses 56 defect data sets from the PROMISE repository [10]). Ro- driguez et al. document 18 repositories, including PROMISE, 1 Elsewhere we have explained this surprising result by a consideration of clusters built from eigenvectors of the data [9]. Fig. 1: Aggregate the data from the repository into a TDS. Using a ﬁlter with the Test instances ﬁnd F ilteredT DS ⊂ TDS. that offer software project data [11]. However much of this data is irrelevant to speciﬁc projects. Turhan et al. showed that if we use all the data from a Training Data Set (TDS) - an aggregate of multiple data-sets, then the resulting defect predictor will have excessive false alarms [2]. A more recent study by Peters et al. [12] demonstrated that if we used all the data from a TDS, then false alarms and recall would be low. When reasoning about new problems, it is wise to carefully reﬂect about the old data. Before we can ﬁnd defects in local data, we must ﬁlter the TDS to select the most useful Filtered TDS. One such ﬁltering method, shown in Figure 1, is the Burak ﬁlter [2] that returns the nearest TDS instance for each test instance. The core idea of the this ﬁlter is to use Test instances to guide the selection of the Filtered TDS. The Burak ﬁlter must be repeated each time new test data arrives. But is that the right way to do the ﬁltering? Is there any advantage to learning and caching some strutures in the training data before reﬂecting over the test data? The following speculation argues that such an advantage might exist in the form of the Peters ﬁlter: • When one company wants to use data from many other companies, the expected case is that the Test data (from this company) is much less than the the TDS (the training data set) from all the other companies; • When Test is smaller than TDS then there should be more information about defects in the TDS than in Test. • Hence, when selecting relevant data, it might be better to guide that search using the structure of the TDS training data rather than the Test data. 978-1-4673-2936-1/13 c  2013 IEEE MSR 2013, San Francisco, CA, USA Accepted for publication by IEEE. c  2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 409