How to Find Relevant Data for Effort Estimation? Ekrem Kocaguneli, Tim Menzies Lane Department of Computer Science and Electrical Engineering West Virginia University,Morgantown, USA ekocagun@mix.wvu.edu, tim@menzies.us ABSTRACT Background: Building effort estimators requires the training data. How can we find that data? It is tempting to cross the boundaries of development type, location, language, application and hardware to use existing datasets of other organizations. However, prior results caution that using such cross data may not be useful. Aim: We test two conjectures: (1) instance selection can automati- cally prune irrelevant instances and (2) retrieval from the remaining examples is useful for effort estimation, regardless of their source. Method: We selected 8 cross-within divisions (21 pairs of within- cross subsets) out of 19 datasets and evaluated these divisions under different analogy-based estimation (ABE) methods. Results: Between the within & cross experiments, there were few statistically significant differences in (i) the performance of effort estimators; or (ii) the amount of instances retrieved for estimation. Conclusion: For the purposes of effort estimation, there is little practical difference between cross and within data. After applying instance selection, the remaining examples (be they from within or from cross source divisions) can be used for effort estimation. Categories and Subject Descriptors H.4 [Software Cost Estimation]: k-NN; D.2.8 [Software Engi- neering]: Cost—within resource, cross resource 1. INTRODUCTION A recurring problem in effort estimation is finding training data that is relevant to some local problem. When we cannot find enough lo- cal training data, it is tempting to try and import data from other sources. However, it is not clear that this approach is useful: many studies report that using imported data degrades estimation efficacy, perhaps because the imported data is not relevant to the local con- text (e.g. see the Kitchenham et al. [14] and Zimmermann et al. [31] studies discussed later in this paper). In this paper, we offer one solution to the problem of importing relevant data from other sources in order to make estimates about local models. Our solution is based on a fresh look at what it means to say that examples are local or imported. Many publica- tions [2, 6–8, 19, 28, 29] including several of our own [21, 22] either explicatively or tacitly assume “locality(1)”; i.e. clumps of simi- lar projects can be discovered using a single feature. We say that data divided into subsets according to locality(1) can be used for within or cross effort modeling: W ithin studies are localized to one subset; A cross study trains from some subsets and tests on others. As examples of within studies, some authors claim that, for projects in a specific organization, software effort models work best when calibrated with local data collected within that same organization. Proponents of such a within source approach assume that it is best to retrieve training data for examples divided according to: The project type being developed: e.g. embedded, etc; The development centers of the different developers; The development language of the projects; The application type (management information system; guid- ance, navigation, and control; etc); The targeted hardware platform; The in-house or outsourced development projects; If locality(1) was true, then any lessons learned from one orga- nization may never apply to another. For example, we might not be able to transfer lessons learned about effort estimation from one company called (say) “Boeing” to another called “Lockheed- Martin”. If so, then our ability to make general conclusions about software engineering (SE) would be confined to small, highly spe- cialized, sub-groups (e.g. just one company). The opposite to locality(1) is “locality(N )”; i.e. the assump- tion that effort estimation data forms a complex multi-dimensional space that can only be usefully divided using multiple features. If true, then this would be very good news since that would mean that relevant data for effort estimation does not come just from small sub-groups within one organization. Rather, useful data could be collected from many projects including cross sources. Continuing the above example, this would mean that some of the data from Boeing might apply to some of the projects at Lockheed-Martin. Note that, if locality(N ) was true, then this would simplify effort modeling for new projects: just search other contexts for the right data for the new project. Also, it could lead to conclusions about SE that are general to many development contexts. This paper argues for locality(N ) using two predictions that would support locality(N ) and would contradict locality(1):