Semantic Driven Program Analysis Andrian Marcus Department of Computer Science Wayne State University Detroit, MI 48202 313 577 5408 amarcus@wayne.edu Abstract The paper presents an approach to extract and to analyze the semantic content (i.e., problem and solution domain semantics) of existing software systems to support program understanding and software various maintenance tasks, such as: recovery of traceability links between documentation and source code, identification of abstract data types in legacy code, and identification of high-level concept clones in software. The semantic information is derived from the comments, documentation, and identifier names associated with the source code using information retrieval methods. The paper advocates for the use of latent semantic indexing as the underlying support for the semantic driven analysis. The presented results are based on the author’s doctoral dissertation [12]. 1 Introduction The tasks of maintenance and reengineering of an existing software system require a great deal of effort to be spent on understanding the source code to determine the behavior, organization, and architecture of the software not reflected in documentation. Program comprehension is one of the most important software engineering activities. It is vital to learn programming, for debugging, reuse, documentation, verification, and maintenance. There are different types of strategies that a software engineer adopts during comprehension, based on the maintenance task at hand. In particular, there are top- down strategies [3, 16, 20], bottom-up strategies [15, 19 ], and integrated strategies [21]. Each of these strategies results in the development of a mental model by the software engineer. The mental model is in fact a representation of various aspects and dimensions of the software system and relationships among the system entities. In order to define the mental model, the software engineer needs to gather information available from the source code and associated documentation. Different types of information (e.g., static, dynamic, source code, documentation, etc.) will describe different features of the software system. There are at least two key aspects of the system that the user needs to understand: 1) what problem does the software solve and 2) how does the software reach the solution. Static analysis directly supports software comprehension. Various methods exist to perform static analysis of the software system. Most of the existing methods focus on the structural information embedded in the source code, derived mainly from the programming language syntax (e.g., control and data flow). This type of information assists the user to understand how the software works. Software engineers must examine both the structural aspect of the source code and the description of the problem domain (e.g., comments, documentation, and variable names) to extract the information needed to fully understand any part of the system. We describe this type of information as semantic information of the software system. Existing research efforts dealing with semantic information are focused on applying knowledge based, algorithmic, and transformational approaches to the domain of software. Few of the these methods scale up well and are typically very labor intensive, time consuming, costly, and often impractical for large-scale software. Work has been done on the extraction of semantic information from source code, internal comments, and file names [1, 2, 5, 7, 18]. An inherent problem with many of these approaches is that they construct a specific domain model and vocabulary, which restricts the scope and flexibility of their solutions. With this in mind, we are proposing an approach that adopts a less accurate, but much cheaper and more flexible method in extracting and analyzing the semantic information. Specifically, we are proposing the use of an information retrieval method (i.e., Latent Semantic Indexing) to extract and analyze the semantic information from the source code and documentation of a software system. In addition, the novelty of this approach is that it allows the combining of existing and Proceedings of the 20th IEEE International Conference on Software Maintenance (ICSM’04) 1063-6773/04 $20.00 © 2004 IEEE