XMorph: A Shape-Polymorphic, Domain-Specific XML Data Transformation Language Curtis Dyreson #1 , Sourav Bhowmick *2 , Aswani Rao Jannu #3 , Kirankanth Mallampalli #4 , Shuohao Zhang ^5 # Department of Computer Science, Utah State University Logan, UT USA 1 Curtis.Dyreson@usu.edu 3 aswani.jannu@usu.edu 4 kirankanth.mallampalli@usu.edu * Nanyang Technological University Singapore 2 assourav@ntu.edu.sg ^ Marvel San Jose, CA USA 5 shuohao@msn.com Abstract— By imposing a single hierarchy on data, XML makes queries brittle in the sense that a query might fail to produce the desired result if it is executed on the same data organized in a different hierarchy, or if the hierarchy evolves during the life- time of an application. This paper presents a new transformation language, called XMorph, which supports more flexible querying. XMorph is a shape polymorphic language, that is, a single XMorph query can extract and transform data from differently- shaped hierarchies. The XMorph data shredder distills XML data into a graph of closest relationships, which are exploited by the query evaluation engine to produce a result in the shape specified by an XMorph query. I. INTRODUCTION The goal of the research presented in this paper is to make it easier for users to query data, in particular XML data. One factor that adds complexity to querying XML data is that query writers have to know the shape of the data to effectively query it. Long before the advent of XML E. F. Codd wrote about this problem. In his foundational paper on the relational model Codd critiqued the hierarchical model, in part, because it uses asymmetric path expressions to locate data [4]. A path expression is a specification of a path in a hierarchy. Codd presented five hierarchies for a simple part/supplier database and demonstrated that, in general, a path expression formu- lated with respect to one hierarchy would fail on some other. For instance, suppose that the expression supplier/part locates parts “below” suppliers. The same expression fails when the data is organized differently, say when parts are above suppliers. Asymmetric path expressions have resurfaced in XML query languages. In this paper we propose a new, shape-polymorphic, domain-specific data transformation language called XMorph. We invite readers to visit the XMorph project website 1 to experiment with XMorph in an on-line demo or download the Java implementation. XMorph offers the following features in a data transformation language. 1 http://www.cs.usu.edu/~cdyreson/pub/XMorph Easy to specify and transform the data’s shape. The primary component of XMorph is a morph in which the user declares the desired shape of the result. XMorph reorganizes the source data to match the specified shape. Shape polymorphism. In XMorph, only the shape of the output needs to be given, the query adapts to the shape of the input. Shape polymorphism was first described by Jay and Crockett [7]. In shape polymorphism in object-oriented lang- uages, a method, e.g., to print a value, adapts to the shape of the data, e.g., adapts to a tree or a list. This notion applies to database query languages as follows: a language is shape polymorphic if a query evaluated on the same data in different structures yields (approximately) the same result 2 . Ability to identify information loss. The XMorph query engine can analyze a query to determine potential information loss in a transformation. XQuery support. XMorph can be translated to XQuery. Ability to treat attributes as indistinct from sub- elements. Though data modelers often arbitrarily choose to use attributes rather than subelements, XMorph queries do not force users to differentiate between them. Easy creation of groups. XQuery 1.0 has ad-hoc support for groups using a distinct-values function. XQuery 1.1 adds support for grouping in aggregation. XMorph supports both persistent and dynamic group creation for data transformation. Vocabulary translation. To use XMorph, a user has to know the “vocabulary” (e.g., the names of the elements) in a data collection. But XMorph also supports vocabulary translation, so that users can change their terminology. Finally, XMorph is a domain-specific 3 language, lacking many features found in a general-purpose query language like XQuery, such as namespace and white-space handling. XMorph can not guarantee that document order is maintained by a transformation (due to grouping, though without 2 The same result modulo duplicates, ordering, and attribute/sub- element swaps. 3 Domain-specific has nothing to do with a database “domain,” rather it means “special purpose” or dedicated to a specific task. PREPRESS PROOF FILE CAUSAL PRODUCTIONS 1