JOURNAL OF IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. Y, SOMEMONTH 201Z 1 Learning Project Management Decisions: A Case Study with Case-Based Reasoning Versus Data Farming Tim Menzies, Member, IEEE , Adam Brady, Jacky Keung, Member, IEEE , Jairus Hihn, Steven Williams, Oussama El-Rawas, Phillip Green, Barry Boehm Abstract— BACKGROUND: Given information on just a few prior projects, how to learn best and fewest changes for current projects? AIM: To conduct a case study comparing two ways to recommend project changes. (1) Data farmers use Monte Carlo sampling to survey and summarize the space of possible outcomes. (2) Case-Based Reasoners (CBR) explore the neighborhood around test instances. METHOD: We applied a state-of-the data farmer (SEESAW) and a CBR tool (W2) to software project data. RESULTS: CBR with W2 was more effective than SEESAW’s data farming for learning best and recommend project changes, effectively reduces runtime, effort and defects. Further, CBR with W2 was comparably easier to build, maintain, and apply in novel domains especially on noisy data sets. CONCLUSION: Use CBR tools like W2 when data is scarce or noisy or when project data can not be expressed in the required form of a data farmer. FUTURE WORK: This study applied our own CBR tool to several small data sets. Future work could apply other CBR tools and data farmers to other data (perhaps to explore other goals such as, say, minimizing maintenance effort). Index Terms—Search-based software engineering, Case-based reasoning, data farming, COCOMO ✦ 1 I NTRODUCTION In the age of Big Data and cloud computing, it is tempting to tackle problems using: • A data-intensive Google-style collection of gigabytes of data; or, when that data is missing ... • A CPU-intensive data farming analysis; i.e. Monte Carlo sampling [1] to survey and summarize the space of possible outcomes (for details on data farming, see §2). For example, consider a software project manager trying to • Reduce project defects in the delivered software; • Reduce project development effort How can a manager ﬁnd and assess different ways to address these goals? It may not be possible to answer this question via • Tim Menzies (corresponding author) Adam Brady, Phillip Green and Oussama El-Rawas are with the Lane Department of Computer Science and Electrical Engineering, West Virginia University. E-mail: tim@menzies.us, adam.m.brady@gmail.com, deathcheese@yahoo.com, orawas@gmail.com • Jacky Keung is with the Department of Computer Science, The City Uni- versity of Hong Kong, Hong Kong SAR E-mail: jacky.keung@cityu.edu.hk. • Steven Williams is with the School of Informatics and Computing Indiana University, Bloomington. E-mail: stevencwilliams@gmail.com. • Jairus Hihn is with Caltech’s Jet Propulsion Laboratory. E-mail: jairus.hihn@jpl.nasa.gov. • Barry Boehm is with the University of Southern California. E-mail: boehm@sunset.usc.edu. This research was conducted at West Virginia University, University of Southern California, and NASA Jet Propulsion Laboratory under a NASA sub-contract. Reference herein to any speciﬁc commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government. This research was funded in part by NSF,CISE, project #0810879. data-intensive methods. Such data is inherently hard to access. For example, as discussed in §2.2, we may never have access to large amounts of software process data. As to the cpu-intensive approaches, we have been exploring data farming for a decade [2] and, more recently, cloud com- puting [3], [4]. Experience shows that cpu-intensive methods may not be appropriate for all kinds of problems and may introduce spurious correlation under certain situations. In this paper, we document that experience. The experiments of this paper benchmark our SEESAW data farming tool proposed in [5]–[12] against a lightweight case-based reasoner (CBR) called W2 [13], [14]. We ﬁnd that if we over-analyze scarce data (such as the software process data of §2.2) then we run into the risk of drawing conclusions based on insufﬁcient background supporting data. Such conclusions will perform poorly on future examples. Our experience shows that the SEESAW data farming tool suffers from many “optimization failures” where if some test set is treated with SEESAW’s rec- ommendations, then some aspect of that treated data actually gets worse. On the other hand, the W2 CBR tool suffers from far fewer failures. Based on those experiments, this paper will conclude that when reasoning about changes to software projects: 1) Use data farming in data rich-domains (e.g. when rea- soning about thousands of inspection reports on millions of lines of code [15]) and when the data is not noisy and when the software project data can be expressed in the same form as the model inputs; 2) Otherwise, use CBR methods such as our W2 tool.