Semantic Driven Program Analysis
Andrian Marcus
Department of Computer Science
Wayne State University
Detroit, MI 48202
313 577 5408
amarcus@wayne.edu
Abstract
The paper presents an approach to extract and to
analyze the semantic content (i.e., problem and solution
domain semantics) of existing software systems to
support program understanding and software various
maintenance tasks, such as: recovery of traceability
links between documentation and source code,
identification of abstract data types in legacy code, and
identification of high-level concept clones in software.
The semantic information is derived from the
comments, documentation, and identifier names
associated with the source code using information
retrieval methods. The paper advocates for the use of
latent semantic indexing as the underlying support for
the semantic driven analysis.
The presented results are based on the author’s
doctoral dissertation [12].
1 Introduction
The tasks of maintenance and reengineering of an
existing software system require a great deal of effort to
be spent on understanding the source code to determine
the behavior, organization, and architecture of the
software not reflected in documentation. Program
comprehension is one of the most important software
engineering activities. It is vital to learn programming,
for debugging, reuse, documentation, verification, and
maintenance.
There are different types of strategies that a software
engineer adopts during comprehension, based on the
maintenance task at hand. In particular, there are top-
down strategies [3, 16, 20], bottom-up strategies [15, 19
], and integrated strategies [21]. Each of these
strategies results in the development of a mental model
by the software engineer. The mental model is in fact a
representation of various aspects and dimensions of the
software system and relationships among the system
entities. In order to define the mental model, the
software engineer needs to gather information available
from the source code and associated documentation.
Different types of information (e.g., static, dynamic,
source code, documentation, etc.) will describe
different features of the software system. There are at
least two key aspects of the system that the user needs
to understand:
1) what problem does the software solve and
2) how does the software reach the solution.
Static analysis directly supports software
comprehension. Various methods exist to perform
static analysis of the software system. Most of the
existing methods focus on the structural information
embedded in the source code, derived mainly from the
programming language syntax (e.g., control and data
flow). This type of information assists the user to
understand how the software works.
Software engineers must examine both the structural
aspect of the source code and the description of the
problem domain (e.g., comments, documentation, and
variable names) to extract the information needed to
fully understand any part of the system. We describe
this type of information as semantic information of the
software system.
Existing research efforts dealing with semantic
information are focused on applying knowledge based,
algorithmic, and transformational approaches to the
domain of software. Few of the these methods scale up
well and are typically very labor intensive, time
consuming, costly, and often impractical for large-scale
software. Work has been done on the extraction of
semantic information from source code, internal
comments, and file names [1, 2, 5, 7, 18]. An inherent
problem with many of these approaches is that they
construct a specific domain model and vocabulary,
which restricts the scope and flexibility of their
solutions.
With this in mind, we are proposing an approach
that adopts a less accurate, but much cheaper and more
flexible method in extracting and analyzing the
semantic information. Specifically, we are proposing
the use of an information retrieval method (i.e., Latent
Semantic Indexing) to extract and analyze the semantic
information from the source code and documentation of
a software system. In addition, the novelty of this
approach is that it allows the combining of existing and
Proceedings of the 20th IEEE International Conference on Software Maintenance (ICSM’04)
1063-6773/04 $20.00 © 2004 IEEE