Understanding Class Evolution in Object-Oriented Software Zhenchang Xing and Eleni Stroulia Computing Science Department University of Alberta Edmonton AB, T6G 2H1, Canada {xing, stroulia}@cs.ualberta.ca Abstract In the context of object-oriented design, software systems model real-world entities abstractly represented in the system classes. As the system evolves through its lifecycle, its class design also evolves. Thus, understanding class evolution is essential in understanding the current design of the system and the rationale behind its evolution. In this paper, we describe a taxonomy of class-evolution profiles, a method for automatically categorizing a system's classes in one (or more) of eight types in the taxonomy, and a data-mining method for eliciting co-evolution relations among them. These methods rely on our UMLDiff algorithm that, given a sequence of UML class models of a system, surfaces the design-level changes over its lifecycle. The recovered knowledge about class evolution facilitates the overall understanding of the system class-design evolution and the identification of the specific classes that should be investigated in more detail towards improving the system-design qualities. We report on two case studies evaluating our approach. 1 Motivation and Background The objective of reverse engineering is most often to enable software understanding in support of maintenance, feature enhancement and adaptation activities [5]. In object-oriented systems, classes model abstractions of real-world entities around which these systems are designed. Therefore, understanding the system classes, i.e., their internal structure and their role in the context of the system functionality and behavior, constitutes a crucial step towards understanding the overall system design for both maintenance and new development. There have been several research efforts to date aiming at understanding systems at the class level. For example, Lanza et al. [17] introduced the “class blueprint”, a visualization of the internal structure of system classes at a particular point in their lifecycle. The class blueprint distinguishes among different types of classes, such as classes with wide interfaces that offer many entry points to their functionalities, definers that reside at the top of a hierarchy or specializers that are leaves of an hierarchy, etc. However, such visualizations require a substantial interpretation effort on behalf of their users and become fairly “unreadable” for large systems with numerous classes. Furthermore, given that most software development nowadays adopts an evolutionary lifecycle model, analyzing a single snapshot of a system’s classes enables only limited insight; a comparative analysis of a sequence of snapshots should be more valuable in understanding the system’s design rationale. For example, consider a software maintainer who wants to identify “hotspots”, i.e., areas of substantial evolutionary activity, over the lifespan of a software system. By comparing a set of subsequent versions, he may find out that a few classes have been substantially changed in every new version, irrespective of what features were modified in this version. This evidence of highly coupled design may focus his examination into the source code of these classes to determine the cause of problem and to propose modifications to remedy it. Such evolutionary analysis was the objective of Demeyer et al. [7], which investigated the use of comparative analysis of software metrics for drawing inferences regarding the evolution of a system. However, the result of their analysis refers to the system as a whole and does not provide any insight regarding the evolution of individual or groups of classes. Another, potentially more precise, source of evolutionary information could be documentation, either at the source-code level or at the change-log level of the version-management system used for the development of the software system. Unfortunately, more frequently than not, such documentation is sparse and inconsistent [5,13]. In our work on understanding class evolution in object-oriented systems, we have adopted class models of subsequent system snapshots (which may be released versions or simply snapshots checked-out in regular time intervals) as the primary input of our method. These class models are easily obtainable, given the source code that resides in a version-management system and any of a variety of existing round-trip software-development tools [29,30], and they are, by their very nature, fairly accurate representations of the source [19]. The fundamental intuition underlying our method is that by Proceedings of the 12th IEEE International Workshop on Program Comprehension (IWPC’04) 1092-8138/04 $ 20.00 © 2004 IEEE