Using Association Rules to Identify Similarities between Software Datasets Saba Anwar SSE, LUMS Lahore, Pakistan 10030040@lums.edu.com Zeeshan Ali Rana SSE, LUMS Lahore, Pakistan zeeshanr@lums.edu.com Shafay Shamail SSE, LUMS Lahore, Pakistan sshamail@lums.edu.com Mian M. Awais SSE, LUMS Lahore, Pakistan awais@lums.edu.com Abstract—A number of V&V datasets are publicly available. These datasets have software measurements and defectiveness information regarding the software modules. To facilitate V&V, numerous defect prediction studies have used these datasets and have detected defective modules effectively. Software developers and managers can benefit from the existing studies to avoid analogous defects and mistakes if they are able to find similarity between their software and the software represented by the public datasets. This paper identifies the similar datasets by comparing association patterns in the datasets. The proposed approach finds association rules from each dataset and identifies the overlapping rules from the 100 strongest rules from each of the two datasets being compared. Afterwards, average support and average confidence of the overlap is calculated to determine the strength of the similarity between the datasets. This study compares eight public datasets and results show that KC2 and PC2 have the highest similarity 83% with 97% support and 100% confidence. Datasets with similar attributes and almost same number of attributes have shown higher similarity than the other datasets. I. I NTRODUCTION Similarity between two software can help in estimating development effort and budget of one software using the experiences in the other software. Similarly defect patterns in one software can help avoid defects in a similar software. Usually the software project data is not publicly available and different organizations working on similar projects cannot benefit from each other. However, V&V datasets with software measurements and software defects are publicly available [1]. Organizations can find similarity of their software data with these datasets, determine the defect occurrence patterns and perform testing activities accordingly. Similarity between the datasets can be determined using different methods like association mining [2], [3], [4] and fuzzy logic [5], [6], [7]. Parthasarathy et al. [2] have found association rules using ECLAT [8] and have presented a similarity measure Sim(A, B) to find similarity between homogeneous datasets. This similarity has been calculated based on support count and a parameter alpha to reflect variations in support count. Experimental results have shown that their algorithm can adapt to time constraints by providing quick speed up of 5 to 7 and accurate estimates within 2 % of similarity. Tao Li et al [3] have proposed similarity measure between basket datasets based on associations. The measure employs support counts using a formula inspired from information entropy. Dudek et al. [4] have presented some new measures for comparing association rulesets for distributed mining. Azzeh et al [5] and Idri et al. [6], [7] have proposed measures based on fuzzy logic to evaluate the similarity between two software projects when they cannot be described by numerical data and need linguistic values. Further process measures have been employed to find soft- ware project similarity [9]. Barreto et al. [9] have identified characteristics from software process that can determine the similarity among software projects and have presented a measure to indicate the level of similarity among the projects to improve software project monitoring process. There are certain issues with the existing work that need to be addressed. Association mining based approaches discussed above assume the datasets to be homogenous which is not the case always. There are datasets in PROMISE repository [1] that use different measures but can be used to determine if they represent the similar software. The discussed fuzzy approach is applicable when numeric software data is not available. There are very few datasets with process measures which reduces the chances of benefiting from public datasets. A large proportion of the public datasets consists of software product measures. In order to increase their utility and benefit from the defect patterns in the datasets an approach to find dataset similarity is required. Similarity between two software datasets can be found by identifying similar patterns of co-occurrences of attribute values in the datasets. If certain attribute values co- occur similarly in two datasets we say that the datasets behave similarly. In this paper we present a three step Association Rule Mining(ARM) based strategy to find similarity between two datasets. In first step we apply Apriori algorithm [10] to generate association rules for each dataset such that we have a ruleset for each dataset. In second step we label the rules in the rulesets. In the last step we find similarity between the two rulesets by identifying overlapping rules. Afterwards, we find the support and confidence of the overlap. Rest of the paper is organized as follows. Section II dis- cusses our methodology, section III presents the experimental results, section IV gives an analysis of the results and section V concludes the paper and provide future directions. Authors' Version