THE INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT2014) University of Nizwa, Oman December 9-11, 2014 Page 164 Fault-Proneness of Open Source Systems: An Empirical Analysis Mamdouh Alenezi College of Computer & Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia Shadi Banitaan College of Engineering & Science University of Detroit Mercy Detroit, MI 48221, USA Qasem Obeidat Department of Software Engineering, Al-Zaytoonah University of Jordan, Amman 11377, Jordan Abstract: Developing quality software is a very complex job considering the complexity and size of software developed these days. Early prediction of software quality assists in optimizing testing resources. Many fault prediction models have been developed using several internal attributes and different machine learning techniques. However, the open-source community still lacks a concise knowledge about what types of internal attributes affect the software quality the most. In this work, an empirical investigation is conducted to explore the relationships between internal attributes of open-source systems and their fault-proneness. The results of the empirical analysis showed that by selecting only nine internal attributes, the fault prediction models accuracy did not decrease significantly. This indicates that only a subset of these internal attributes is worth collection and investigation. By focusing on a small set of internal attributes, the quality assurance team can save time and resources while achieving high accuracy fault- proneness predictions. Keywords: fault-proneness, open-source systems, internal attributes, quality. 1. Introduction Even when applying the principles of the software development methodologies, a fault-free software system is very difficult to achieve. The maintenance phase of software projects is a very challenging and costly. The extent of resources spent on software maintenance is much more than what is being spent on its development. Consequently, any part of maintenance that can be automated will eventually lead to saving maintenance resources. One possible area where an effort is beneficial to lower the maintenance costs is identifying the source code parts that are presumably to encompass faults and therefore require changes. A noteworthy research work has been devoted to describing particular quality metrics and building quality predictive models based on internal attributes. Fault prediction models are usually built using software metrics and previously collected fault data. These models are then utilized to guide decision- making in the course of development. Fault- proneness is the most frequently investigated dependent variable [1]. Predicting the classes fault-proneness helps in focusing the effort of validation and verification, which helps in finding more faults for the same effort. In case of predicting a class is as likely to be faulty, corrective actions can be invested to test and inspect the class. Fault prediction will channel the focus of the developers to carefully examine and test the files or classes that have a high probability of defectiveness. Focusing the effort on faulty classes will help in managing and utilizing the resources of the software project more efficiently. This will make the maintenance phase easier for both the customers and the project owners. Software fault prediction models depend on the information available in software metrics. The software metrics data quality plays a significant role in building accurate prediction models. The selection of a subset of the software metrics is an essential part of the model building process. Focusing on a subset of these metrics will save the time needed to collect and manage them. In addition, using a reduced metrics set in building predictive models will lead to better classification speed. In this study, we investigate the association between internal quality attributes (source code metrics) and fault-proneness of open source projects. The remainder of this paper is organized as follows. Section 2 presents the used methodology. The experimental evaluation is given in Section 3. Some threats to validity are presented in Section 4. Section 5 discusses related work. Conclusions of the research are presented in Section 6. 2. Methodology To select a subset of software metrics that are sufficient to predict faulty classes, this study investigate twenty internal attributes of eight open source software systems. This study compares four different classifications. It also applies a feature selection technique to find the subset of metrics that are sufficient to predict faulty classes in open-source software projects. 2.1 Feature Selection Hall and Holmes [2] categorized feature selection algorithms to (1) algorithms that evaluate individual attributes and (2) algorithms that evaluate subset of attributes. The first category of feature selection algorithms identifies which metric is able to serve as discriminatory attribute for indicating an external quality. The second category selects a subset of features that are best to identify the class label. In this study, the second category is selected since the goal is to identify which subsets of metrics are better in identifying faulty classes in open source software systems. Feature ranking algorithms evaluate attributes individually based on a certain measure and order them accordingly. Although that some attributes may be less useful by themselves but the can make a substantial contribution when combined with other attributes. Feature subset selection