Software cost prediction with predefined interval estimates Stamatia Bibi, Ioannis Stamelos, Lefteris Angelis Abstract Defining the required productivity in order to complete successfully and within time and budget constraints a software development project is actually a reasoning problem that should be modelled under uncertainty. One way of achieving this is to estimate an interval accompanied by a probability instead of a particular value. In this paper we compare traditional methods that focus on point estimates, methods that focus both on point and interval estimates and methods that produce only predefined interval estimates. In the case of predefined intervals, software cost estimation becomes a classification problem. All the above methods are applied on two different data sets, namely the COCOMO81 dataset and the Maxwell dataset. Also the ability of classification techniques to resolve one classification problem in cost estimation, namely to determine the software development mode based on project attributes, is assessed and compared to reported results. 1. Introduction Estimating the cost required for a software development project is one of the crucial aspects of project planning and management but still remains an open issue, due to the diversity of cost factors, their unclear contribution to productivity, the high degree of uncertainty and the lack of information in the early stages of software development. For these reasons, low accuracy and unsuccessful estimations seem to have been inevitable so far. Software cost estimation actually involves the estimation of productivity or effort needed to complete a project. Most methods proposed produce point estimates of these attributes, along with prediction intervals, in an attempt to consider the uncertainty or risk associated to the estimation process. However, because software cost data sets are small (counting tenths of projects in most cases), estimate intervals are often too large to be useful for practical cost estimation. Therefore, researchers tend to neglect estimate intervals, focusing on the point estimates. This is controversial to the fact that the development of a software artifact is a human driven procedure, where unexpected problems may arise. For a more realistic approach, it is necessary to consider both uncertainty and risk, weighing the chance of events occurring and the impact they might have. It is useful to reflect the level of uncertainty in the estimate and interval estimates can lead to that direction. Usually interval estimates are created during the estimation process by firstly making a point estimate and then assessing prediction intervals. However, there is also the possibility to pre- define the intervals of productivity before the estimate generation. This can be done in order to control the estimation procedure, distribute the projects in the training data set as uniformly as possible into the various productivity intervals and ensure that estimate intervals will not be too large. The target of this study is to identify techniques that are capable to produce predefined intervals and to provide some evidence of the prediction accuracy of those techniques, comparing them also to techniques that have traditionally focused on point estimates. Classification and Regression Trees (CART) is a technique that may produce both point and interval estimates. Two techniques that produce interval estimates only are Bayesian Belief Networks (BBN) and Association Rules (AR). Known techniques focusing on point estimates are Ordinary Least Squares (OLS) and Forward Pass Residual Analysis (FPRA) and Analogy Based Estimation (ABE). Also the comparison between learning-oriented techniques (OLS, FPRA), machine learning techniques (CART, BBN, A.R) and expertise-based techniques -237-