Towards A Portable XML-based Source Code Representation Ying Zou and Kostas Kontogiannis Dept. of Electrical & Computer Engineering University of Waterloo Waterloo, ON, N2L 3G1, Canada {yzou, kostas}@swen.uwaterloo.ca Abstract Program representation is a critical issue in the area of software analysis and software re-engineering. It heavily relates to the portability and effectiveness of the software analysis tools that can be developed. This paper describes an approach that focuses on source code representation schemes in the form of Abstract Syntax Trees that are encoded as XML documents. These XML source code representations conform with a DTD we call the domain model. By utilizing a domain model for a given programming language we can build tools on top of an XML DOM tree. The XML DOM tree has a standard API that all tools developed using this approach can use. In such a way, software analysis tools can interoperate or be easily integrated in what we refer to as Integrated Software Maintenance Environments. 1. Introduction One of the greatest challenges for software analysis and software re-engineering is to design and implement parsers in order to access the intermediate representation of the source code. Even for a simple programming language, the effort to develop a parser could be very high [3, 6]. As an alternative approach, software analysis tools can be built on top of the existing CASE environments that keep representations of the source code of the system being analyzed in a proprietary information base format, and offer an API to access these internal source code representations. Such environments include Refine [1], Datrix, and IBM VisualAge for Java and C++ CodeStore [3]. However, any tool, which is developed based on such an information base, lacks portability and interoperability with other software analysis tools. As a third alternative approach, we present a source code representation framework that is based on a domain model and a Document Type Definition (DTD) and provides a standard API for interoperable and portable software analysis to be built. In a nutshell, the process of software reverse engineering focuses on the decomposition of a software system into objects and relationships that are stored in an information base, and on the creation and transformation of various program views [4]. These views can be generated from the proposed XML representation of the Abstract Syntax Tree. In such a way, software analysis and re-engineering tools can be developed on top of XML based Abstract Syntax Trees instead of proprietary formats. These tools will be fully interoperable since they share the same API as the W3C’s DOM tree API. In this paper, we present this approach and discuss its advantages and limitations. The rest of the paper is organized as follows. Section 2 provides a brief introduction to the concepts pertaining to Abstract Syntax Trees. Section 3 presents the approach adopted on modeling source code in terms of XML documents and XML DOM trees. In section 4, a case study for developing domain models for the C programming language is presented. Related work is explored in section 5. Finally, section 6 provides pointers for on-going and future work. 2. Abstract syntax tree and XML Abstract syntax trees There is a spectrum of levels of granularity at which source code is represented. At the lowest level of detail, Abstract Syntax Trees (ASTs) are used. These contain information about the source program [6] in the form of nodes and edges [2]. Such tree-like structures represent the source program in a top-down matter. For example, C applications are represented at the top level as applications, modules, and files, while at lowest levels as functions, declarations, macros ,