1 Reification of Foreign Type Systems Mark Grechanik, Don Batory, and Dewayne E. Perry UT Center for Advanced Research In Software Engineering (UT ARISE) University of Texas at Austin Austin, Texas 78712 {gmark, batory}@cs.utexas.edu, perry@ece.utexas.edu Abstract. Building systems from existing applications and data sources is common practice. Semi-structured data sources, such as XML, HTML, and databases, and program- ming languages, such as C# and Java, conform to well- defined, albeit different, type systems, each with their own unique underlying representations. As a consequence, writ- ing programs that access and update data in foreign type sys- tems (FTSs), i.e., type systems that are different from the host programming language, is a notoriously difficult task. In this paper, we present a simple, practical, and effective way to develop and maintain FTS-based systems. We accomplish this by abstracting foreign data as graphs and using path expressions for traversing and accessing data. Path expressions are implemented by type reification — turning foreign types into first-class objects and enabling access to and manipulation of their instances. Doing this results in multiple benefits, including coding simplicity and uniformity (neither of which was present before), that have been demonstrated in a complex commercial project. The contribution of this paper is an approach that allows pro- grammers to operate on foreign types and their instances without writing or generating additional code. We know of no other approach with comparable benefits. 1 Introduction Building software systems from existing applications is a well-accepted practice. Applications are often written in dif- ferent languages and provide data in different formats. An example is a C++ application that parses an HTML-based web page, extracts data, and passes the data into a relational database. A fundamental problem of engineering these systems is how to operate on different formats and types without introducing unnecessary complexity. Different data formats and lan- guages have different type systems. A markup language type (for example, an HTML tag) does not have a direct counter- part in C++, and a C++ class has no explicit counterpart as an HTML tag. Sometimes keywords describing types in grammars may be the same, for instance, the keyword int describes the integer types in C++ and Java, but their internal representations can differ. In this respect every language type system is unique. As a consequence, systems that manipulate data in foreign type systems (FTSs) are very difficult to develop. Currently, programmers must map foreign types and their instances to types and their instances in a host programming language. Complicating this development is the sheer multiplicity of built-in and user-defined types required to do this mapping. Recall the C++ program that parses HTML-based data and updates a database with the information retrieved. Even though this task sounds trivial, in reality it is complex and is composed roughly of the following steps: • Locate and analyze the HTML-based data; • Map these types onto C++-specific types; • Analyze the database schema and determine the map- ping between the C++ types and database entities; • Write functions that parse HTML documents, retrieve HTML data, and convert this data to its C++ counter- parts; • Write functions that convert C++ objects into SQL que- ries that are executed against the database. This approach suffers from multiple drawbacks. First, it leads to software that is extremely difficult to extend and scale. Since C++ types are different from HTML types, pro- grammers must go through a complex process of parsing and mapping HTML data to its counterparts in C++. Moreover, a programmer must create a set of functions whose semantics reflects the operations on these types. With a growing num- ber of types and functions, program complexity becomes increasingly unmanageable and thus limits scalability and extensibility. Second, this approach often leads to non-uniformity in FTS- based code. Since programmers are constrained neither in their mappings between foreign-types-to-host-types nor in their operations on mapped types, the resulting code for dif- ferent type systems looks different, even when it is written