* Dissertation advisor Using Information Retrieval to Support Design of Incremental Change of Software Denys Poshyvanyk, Andrian Marcus* Department of Computer Science Wayne State University Detroit, MI 48202 1-313-577-5408 [denys, amarcus]@wayne.edu ABSTRACT The proposed research defines an approach to combine Information Retrieval based analysis of the textual information embedded in software artifacts with program static and dynamic analysis techniques to support key activities of the incremental change of software, such as concept and feature location. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement – enhancement, restructuring, reverse engineering, and reengineering General Terms Algorithms, Design, Experimentation, Performance Keywords Program understanding, feature identification, concept location, impact analysis, change propagation, dynamic and static analyses, information retrieval, coupling and cohesion measurement 1. PROBLEM DESCRIPTION During the evolution of large scale software systems most activities involve making changes to the existing source code. Identifying the parts of the source code that correspond to a specific functionality is a prerequisite to program comprehension and is one of the most common activities undertaken by developers. This process is called concept (or feature) location and it is a part of the incremental change of software process [30]. Although incremental change ultimately needs to identify all components to be changed, the programmer must find the location in the code where the first change must be made. For that, the programmer uses a search process where the search space is the whole software and where diverse search techniques narrow down the search space. The literature limits this step to finding a small number of feature components. The full extent of the change is then handled by impact analysis, which is used to identify the remaining impacted components. In this research proposal, we are specifically addressing the identification of methods in object- oriented software that are part of the implementation of a feature (i.e., they change when the feature is altered) and can be used as a starting point in impact analysis. While developers often perform feature location manually, tool support is needed for large and complex programs. Existing tools supporting feature location rely on data obtained via static and–or dynamic analysis of the program. While dynamic analyses often can not discriminate overlapping features, static analyses better filter and organize data, but they can rarely identify precisely elements of source code contributing to a specific execution scenario. The research community has long recognized the need to combine static and dynamic techniques [11] to improve the effectiveness of feature location [3, 9, 32, 36]. All these techniques are designed to be applied on the source code yet they do not capture important textual (or lexical) information which is embedded in identifiers and comments present in source code etc. Artefacts, such as call graphs or execution traces, generated from the source code provide in their structure information on how the system works, whereas textual artifacts capture information on what the system does, as well as important knowledge about the software domain, design decisions, developer information, communication, etc. We refer to these two types of information as structural and semantic, respectively. In order to locate features and change a software system, developers must understand both what the system does and how it works, hence they need to analyze the two types of information. While these two types of information are complementary, there is little support for their combination. In particular, many of the existing tools do not provide explicit representation for the semantic information, but rather assume the implicit representation embedded in the textual software artifacts. 2. RESEARCH GOALS We propose the use of Information Retrieval (IR) techniques to extract and represent the semantic information in large scale software systems such that it can be automatically combined with structural information to better support concept and feature location in source code. Specifically, the research will focus on combining IR-based analysis data with the analysis of program dependencies, execution traces to define new techniques for feature location. We expect that these new techniques will contribute directly to improvement of design of incremental change and thus increased software quality and reduction of software maintenance costs. Copyright is held by the author/owner(s). ASE’07, November 5–9, 2007, Atlanta, Georgia, USA. ACM 978-1-59593-882-4/07/0011.