Component Clustering Based on Maximal Association Kamran Sartipi Kostas Kontogiannis University of Waterloo Dept. of Computer Science and, Dept. of Electrical & Computer Engineering Waterloo, ON. N2L 3G1, Canada ksartipi, kostas @swen.uwaterloo.ca Abstract In this paper, we present a supervised clustering frame- work for recovering the architecture of a software system. The technique measures the association between the sys- tem components (such as files) in terms of data and control flow dependencies among the groups of highly related en- tities that are scattered throughout the components. The application of data mining techniques allows to extract the maximum association among the groups of entities. This as- sociation is used as a measure of closeness among the sys- tem files in order to collect them into subsystems using an optimization clustering technique. A two-phase supervised clustering process is applied to incrementally generate the clusters and control the quality of the system decomposi- tion. In order to address the complexity issues, the whole clustering space is decomposed into sub-spaces based on the association property. At each iteration, the sub-spaces are analyzed to determine the most eligible sub-space for the next cluster, which is then followed by an optimization search to generate a new cluster. 1 Introduction With the increase of size and complexity of the software systems, the role of software architecture recovery as a pre- lude for most software maintenance tasks is increasingly important. The approaches to architectural recovery usu- ally employ a clustering technique based on the properties of coupling and cohesion or association, in order to partition the software system into cohesive subsystems. In this paper, we present a supervised (user-assisted) clustering technique, as a framework for architectural re- This work was funded by IBM Canada Ltd. Laboratory - Center for Advanced Studies (Toronto) and the National Research Council of Canada. covery, and define two similarity metrics based on the asso- ciation property in highly related groups of system entities. In this context, the software system is represented as a graph, where the system entities are denoted as nodes and data/control dependencies are denoted as edges. The appli- cation of data mining techniques on this graph reveals the groups of highly related entities. The level of dependency among the groups of system en- tities is the basis for extracting association values between system components such as files. In this respect, the com- ponent association is defined as the degree to which, the entities in one component are related to the entities in an- other component. We define a new similarity measure be- tween two system components in terms of the component association values and use it for the clustering algorithm. The proposed supervised clustering technique, itera- tively generates the subsystems of components (files) using the similarity measure between the system components and based on a two-phase optimization clustering process. The search space for the whole clusters are decomposed into a number of sub-spaces to handle the search complexity. At each iteration of the clustering process, a selection al- gorithm is used to select a sub-space for the next cluster, and an optimization search algorithm is used to collect the highly similar components around the core component of the selected sub-space, known as main-seed. In a nutshell, this paper first provides a new similarity metric based on maximal association property (maximum number of shared properties) between two groups of entities such as files; next discusses a supervised clustering tech- nique for decomposing a large system of files into cohesive subsystems; and finally discusses a search space reduction technique to manage the search complexity. The results of experimenting with two systems are also presented. We implemented a prototype reverse engineering tool, Alborz [15], to recover the architecture of a software sys- tem as cohesive components. Depending on the user exper-