Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm Cagatay Catal a,⇑ , Ugur Sevim a , Banu Diri b a The Scientiﬁc and Technological Research Council of Turkey (TUBITAK) and the National Research Institute of Electronics and Cryptology (UEKAE), Information Technologies Institute, Kocaeli, Turkey b Yildiz Technical University, Department of Computer Engineering, Istanbul, Turkey article info Keywords: Machine learning Naive Bayes Eclipse technology Software fault prediction abstract Despite the amount of effort software engineers have been putting into developing fault prediction mod- els, software fault prediction still poses great challenges. This research using machine learning and sta- tistical techniques has been ongoing for 15 years, and yet we still have not had a breakthrough. Unfortunately, none of these prediction models have achieved widespread applicability in the software industry due to a lack of software tools to automate this prediction process. Historical project data, including software faults and a robust software fault prediction tool, can enable quality managers to focus on fault-prone modules. Thus, they can improve the testing process. We developed an Eclipse-based software fault prediction tool for Java programs to simplify the fault prediction process. We also inte- grated a machine learning algorithm called Naive Bayes into the plug-in because of its proven high-per- formance for this problem. This article presents a practical view to software fault prediction problem, and it shows how we managed to combine software metrics with software fault data to apply Naive Bayes technique inside an open source platform. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction Over the last 15 years software fault prediction models have received a lot of attention from software engineering researchers and machine learning experts. However, a robust fault prediction model alone is not enough in today’s competitive software indus- try. There is a great need for software tools that make it easier for software quality professionals or project managers to predict faults before they occur. Building and application of a fault prediction model within a software company is time-consuming, detailed, meticulous work and mostly commercial projects do not have enough resources to realize this activity (Ostrand & Weyuker, 2006). On the other hand, the beneﬁts of this Quality Assurance (QA) activity are impressive. By using such models, one can identify the refactoring candidate modules, improve the software testing process, select the best design from design alternatives with class level metrics, and reach a dependable software system (Catal & Diri, 2008). Hence, a tool for simplifying the prediction process is extremely useful to projects that might not be able to allocate nec- essary resources for this QA activity (Ostrand & Weyuker, 2006). Software fault prediction approaches use previous software metrics and fault data to predict the fault-prone modules for the next release of software. If an error is reported during system tests or in ﬁeld, that module’s fault data is marked as 1, otherwise it is marked as 0. For the prediction modeling, software metrics are used as independent variables and fault data (1 or 0) is used as the dependent variable. Therefore, we need a version control sys- tem (VCS) such as Subversion to store source code, a change man- agement system (CMS) such as ClearQuest to record fault data, and a tool to collect product metrics (method-level or class-level) from source code. Parameters of the prediction model are calculated using previous software metrics and fault data. Different version control systems and change management sys- tems may have different kinds of Application Programming Inter- faces (APIs) and therefore, we did not aim to use a speciﬁc type of VCS or CMS during the development of our software fault pre- diction tool. In addition, designing a one-size-ﬁts-all tool would not be easy as it seems. We decided to let the prediction tool to cal- culate the software metrics from a source code directory instead of a VCS. In addition, our prediction tool does not directly read fault data from CMS because sometimes every code change does not necessarily mean a fault. By keeping this strategy in mind, we let the user to add the fault data for each module from an Eclipse edi- tor. Next sections will depict this easy-to-use operation with a ﬁg- ure. For this reason, our prediction tool is not dependent on any 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.08.022 ⇑ Corresponding author. E-mail addresses: cagatay.catal@bte.mam.gov.tr (C. Catal), ugur.sevim@boun. edu.tr (U. Sevim), banu@ce.yildiz.edu.tr (B. Diri). Expert Systems with Applications 38 (2011) 2347–2353 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa