JOURNAL OF ???, VOL. 6, NO. 1, JANUARY 2007 1 Software Effort Estimation and Conclusion Stability Tim Menzies, Member, IEEE, Omid Jalali, Jairus Hihn, Dan Baker, and Karen Lum Abstract— This paper revisits the conclusion instability problem identified by Kitchenham, Foss, Myrtveit et.al.; i.e. conclusions regarding which software effort estimation method is “best” is highly contingent on (1) the evaluation criteria and (2) the subset of the data used in the evaluation. Using non-parametric methods (the Mann-Whitney U test), we show how to avoid conclusion instability. This paper reports a study that ranked 158 effort estimation methods via three different evaluation criteria and hundreds of different randomly selected subsets. The same four methods were ranked higher than the other 154 methods regardless of which evaluation criteria or data subset was applied. Hence, we recommend non-parametric evaluation to evaluate and prune effort estimation methods. More specifically, when learning effort estimators from COCOMO-style data, we find that manual stratification defeats many complex algorithmic methods. However, we can do better than manual stratification by augmenting Boehm’s local calibration method with simple linear-time row and column pruning pre-processors. We also advise against model trees, linear regression, exponential time feature subset selection, and (unless the data is sparse) methods that average the estimates of nearest neighbors. To the best of our knowledge, this report is the first to offer stable conclusions regarding effort estimation across such a wide range of methods. Index Terms— COCOMO, effort estimation, data mining, evaluation, Mann-Whitney U test, non-parametric tests. I. I NTRODUCTION Software effort estimates are often wrong. Initial estimates may be incorrect by a factor of four [1] or even more [2]. As a result, the allocated funds may be inadequate to develop the required project. In the worst case, over-running projects are canceled, wasting the entire development effort. For example, in 2003, NASA canceled the CLCS system after spending hundreds of millions of dollars on software development. The project was canceled after the initial estimate of $206 million was increased to between $488 million and $533 million [3]. On cancellation, approximately 400 developers lost their jobs [3]. While the need for better estimates is clear, there exists a very large number of effort estimation methods [4], [5] and no good criteria for selecting between them. Few studies empirically compare all these techniques. What is more usual are narrowly focused studies (e.g. [2], [6], [7], [8]) that test, say, linear regression models in different environments. Kitchenham et.al. [9], Foss et.al. [10] and Myrtveit et.al. [11] (hereafter, KFM) have doubted the practicality of comparatively assessing L different learners processing D data sets. The results of such a comparison, they argue, vary according to the sub-sample of the data being processed and the applied evaluation criteria. Foss et.al. comment that it . . . is futile to search for the Holy Grail: a single, simple-to-use, universal goodness-of-fit kind of met- ric, which can be applied with ease to compare (different methods). [10, p993] Tim Menzies, Omid Jalali, and Dan Baker are with the Lane De- partment of Computer Science and Electrical Engineering, West Vir- ginia University, USA: tim@menzies.us, ojalali@mix.wvu.edu, danielryanbaker@gmail.com. Jairus Hihn and Karen Lum at with NASA’s Jet Propulsion Laboratory: jhihn@mail3.jpl.nasa.gov, karen.t.lum@jpl.nasa.gov. The research described in this paper was carried out at West Virginia Uni- versity and the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the US National Aeronautics and Space Administration. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not constitute or imply its endorsement by the US Government. Download: http://menzies.us/pdf/07stability.pdf. Manuscript received July 31, 2007; revised XXX, XXXX. Methodologically, KFM’s conclusion instability is highly problematic. Unless we can rank methods and prune inferior methods, we will soon be overwhelmed by a growing number of (possibly useless) effort estimation methods. New open source data mining toolkits are appearing with increasing fre- quency such as the R project 1 , Orange 2 , and the WEKA [12]. Such tools tempt researchers to over-elaborate their effort estimation tools. For example, our own COSEEKMO tool [13] takes nearly a day to run its 158 methods. Much of that execution is wasted since, as shown below, 154 of those methods are superfluous. The rest of this paper presents the ranking and pruning results that culled 154 COSEEKMO methods. Rather than seeking the best method, we will seek a small set of methods that perform better than the rest. COSEEKMO contains such a best set of four methods. Further, in a result that is a counter- example to the KFM studies, the same set of four methods is best in studies using three different evaluation criteria and hundreds of different randomly selected subsets. We explain our differences from the KFM study as follows. The root cause of conclusion instability is a very small number of estimates with very large errors. If these outliers fall into some of the subsets, then those subsets will have dramatically different performance results; i.e. will exhibit conclusion in- stability. Non-parametric statistics such as the U test proposed by Mann and Whitney [14] mitigate the outlier problem. The U test uses ranks, not precise numeric values. For example, if treatment A generates N 1 = 5 values {5,7,2,0,4} and treatment B generates N 2 =6 values {4,8,2,3,6,7}, then these sort as follows: Samples A A B B A B A B A B B Values 0 2 2 3 4 4 5 6 7 7 8 On ranking, averages are used when values are the same: Samples A A B B A B A B A B B Values 0 2 2 3 4 4 5 6 7 7 8 Ranks 1 2.5 2.5 4 5.5 5.5 7 8 9.5 9.5 11 1 http://www.r-project.org/ 2 http://www.ailab.si/orange/