Multivariate Logistic Regression Prediction of Fault-Proneness in Software Modules Goran Mauša * , Tihana Galinac Grbac * and Bojana Dalbelo Bašić ** * Faculty of Engineering, University of Rijeka, Rijeka, Croatia ** Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia goran.mausa@riteh.hr, tihana.galinac@riteh.hr, bojana.dalbelo@fer.hr Abstract - This paper explores additional features, provided by stepwise logistic regression, which could further improve performance of fault predicting model. Three different models have been used to predict fault-proneness in NASA PROMISE data set and have been compared in terms of accuracy, sensitivity and false alarm rate: one with forward stepwise logistic regression, one with backward stepwise logistic regression and one without stepwise selection in logistic regression. Despite an obvious trade-off between sensitivity and false alarm rate, we can conclude that backward stepwise regression gave the best model. I. INTRODUCTION Considering the complexity of modern software products and the numerous constraints which follow its production, it is not unusual for the delivered software to have faults. Software quality models have the task to automatically predict fault prone modules and enable verification experts to concentrate on solving problem areas of the system under development. That is why applying software quality models in the early stages of software life cycle is essential. It provides an efficient defect removal procedure and results in delivering more reliable software products [1, 2]. Fault prediction modeling is an important area of research in software engineering. With overall testing costs estimated at 50% of entire development costs, testing consumes a lot of resources. Ideally, testing should be exhaustive in order to be confident that most faults are detected. In practice, however, due to many constrains, that is not possible and every additional save of resource is more than welcome. Fault prediction can be of assistance there, allowing software engineers to focus development activities on fault-prone code, improving software quality and making better use of resource. Various techniques have been proposed for model building with logistic regression being among better ones. Among various statistical methods, machine learning methods, parametric models, and mixed algorithms, this paper explores the capabilities of logistic regression. The logistic regression has been recognized as one of the best methods used for fault-proneness prediction [1-5]. That is why the multivariate logistic regression is performed to build fault-proneness prediction model. In this paper we are investigating following research questions: 1) Can we choose a smaller subset of independent variables in the fault prediction model using logistic regression to obtain better results? 2) Which static code attributes used as independent variables influence the model prediction performance? Using too many independent variables can have negative effects on model’s fault-proneness prediction, making the model more dependent on the data set currently in use and therefore less general [6]. Selecting the appropriate measures to be used in the model requires a strategy of minimizing the number of independent variables in the model. This paper investigates the usage of static code attributes as independent variables and forward and backward stepwise selection principles in choosing variables. The model performances are evaluated using widely used performance measures such as accuracy, sensitivity and false alarm rate [1, 2]. A defect prediction model should identify as many fault prone modules as possible while avoiding false alarms [7]. A public domain NASA data set is used for building and testing the fault-proneness predicting model. The capabilities of logistic regression are tested on data set CM1, which contains certain parameters describing a program code written in C. The paper consists of following sections: the first section contains the description of whole case study process; the second section examines the potential threats to validity and the third section gives a conclusion based on conducted research. II. CASE STUDY A. Data set CM1 is a public data set acquired from PROMISE (PRedictOr Models In Software Engineering) repository at http://promise.site.uottawa.ca/SERepository. The goal of PROMISE repository is to encourage repeatable, verifiable, refutable, and/or improvable predictive models in software engineering [8]. The CM1 data set’s creator is the NASA Metrics Data Program and the donor was Tim Menzies on December 2, 2004. The CM1 data is obtained from a spacecraft instrument, written in C, containing approximately 506 modules, structured as a matrix of 498 lines and 22 columns, where lines represent different software modules and columns represent different static code attributes. All but the last column describe the complexity of the software code and the last one gives the information weather there was a defect detected or not. MIPRO 2012/CTI 813