Munch: An efficient modularisation strategy to assess the degree of refactoring on sequential source code checkings Mahir Arzoky, Stephen Swift and Allan Tucker Department of Information System and Computing Brunel University Uxbridge, UK {mahir.arzoky, stephen.swift, allan.tucker}@brunel.ac.uk James Cain Quantel Limited Newbury, UK james.cain@quantel.com Abstract—Software module clustering is the process of automatically partitioning the structure of the system using low-level dependencies in the source code, to improve the system’s structure. There have been a large number of studies using the search-based software engineering approach to solve the software module clustering problem. This paper introduces the concept of seeding to modularise sequential source code software versions, in order to measure the degree of refactoring. We have developed a software clustering tool called Munch. We evaluated the efficiency of the modularisation by performing a set of experiments on the dataset. We initially experimented with few fitness functions and as a result chose what we believe the most suitable function EVMD to test on our unique dataset. The results of the experiments provide evidence to support the seeding strategy. Keywords-clustering; modularisation; refactoring; seeding; time series; fitness functions; EVM. I. INTRODUCTION As developers are increasingly creating more sophisticated applications, software systems are growing in both their complexity and size. Systems are composed of entities such as variables and classes, which in turn rely and interact with each other in complex ways. Systems naturally continue to evolve and as they evolve, their structure becomes more complex and harder to track. Thus, software systems need to be regularly maintained in order to cope with the constantly evolving requirements. Maintenance and evolution of systems can be frustrating; as it is difficult for developers to keep a fixed understanding of the system’s structure, as structure change during maintenance. This problem is fuelled by the lack of updated documentation which is at times non-existent. To add to the difficulty of undocumented code, many of the original developers are no longer available to assist with the development. To maintain such systems is a challenging task. Refactoring is one of the most common techniques used to transform software in order to improve its internal quality attributes [16] [18]. Refactoring is defined as the change made to software system which improves the internal structure of the code while maintaining its external behaviour [6]. If applied correctly, refactoring can improve maintainability, enhance performance and simplify the structure of the code. Nonetheless, both managers and developers can be hesitant when it comes to using refactoring due to the amount of effort needed to make even a slight change in the code and also the risk of introducing new bugs. Hence, within the development of large software systems, there is significant value in being able to predict when refactoring occurs. Available information in software engineering problems can be incomplete, vague and susceptible to change. As the modular structure of software system tends to decay over time it is important to modularise. Modularisation is the process of partitioning the structure of the software system into subsystems. Subsystems are clusters of source code resources with similar properties combined together to create a high-level attribute of the system. Modularisation also makes the problem at hand easier to understand as it reduces the amount of data needed by developers. According to Constantine and Yourdon [5] good modularisation of software systems leads to easier design, development, testing and maintenance. Consequently, due to the immense interest of automated re-modularisation, through search-based software engineering, fast and effective tools for automated software module clustering are developed. Automated tools are used to generate useful information on system structure. These tools analyse the low-level dependencies in the source code and cluster them into a set of meaningful subsystem. It is important to choose the suitable granularity level of clustering the system at hand. A range of software modularisation techniques [7] [8] [11] [13] has been studied. For various search algorithms, search-based software engineering has shown to be highly robust. The input information for modularisation is dependence information obtained from source code of systems to be modularised. Mancoridis et. al. [10] have first used a Module Dependency Graph (MDG) as a representation of software module clustering problem. MDGs representing the structure of the software system are formed by expressing modules of the system as nodes and expressing the dependence relationship between the modules as edges. The primary purpose of the paper is to perform efficient modularisation on a time series of source code relationships 2011 Fourth International Conference on Software Testing, Verification and Validation Workshops 978-0-7695-4345-1/11 $26.00 © 2011 IEEE DOI 10.1109/ICSTW.2011.87 422