A Graphical Environment for Change Detection in Structured Documents* George J. zyxwvu S. Chang Girish Pate1 Liam Relihan Jason T. L. Wangt Department of Computer and Information Science New Jersey Institute of Technology University Heights, Newark, NJ 07102 zyxw Abstract and other attributes rather than specifying pro- cessing instructions to be carried out on it. zy Change detection in structured documents (e.g. zyxwv SGML) is important in many applications inckud- ing data warehousing, digital libraries and Internet databases. This pape‘r presents a graphical environ- ment zyxwvutsrq for detecting changes in the structured docu- ments. We represent each document by an ordered labeled tree based on the underlying markup language. We then compare two documents by using previously developed algorithms for pattern matching and pattern discovery in trees. Several operators are developed to support the comparison of the documents; graphical devices are provided to facilitate the use of the oper- ators. We believe the proposed tool is useful for not only document management, but also software main- tenance, particularly configuration management and version control, where programs are represented as parse trees and detecting changes in the trees provides a way to find the syntactic differences of two program versions. 1 Introduction It has recently been the trend in document systems technology to emphasize the logical structures inher- ent in many kinds of documents. In general, text pro- cessing systems and word processing systems require additional information to be recorded on the text of the document being processed. This metainformation is usually interspersed among the actual text itself and is often referred to as markup. Individual fragments of markup are called tags. One kind of markup zyxwvutsr - generalized markup - is be- coming increasingly common. Generalized markup is based on two postulates [3, zyxwvuts 51: zyxwvutsr 0 Markup should describe a document’s structure *Work partially supported by NSF grants IN-9224602 and t Contact author; email: jason@village.njit.edu IRI-9531548. 0 Markup should be rigorous in order that tech- niques available for processing other rigorously defined objects (e.g. programs, databases) be available for processing documents also [l]. Generalized markup provides the following advan- tages over the more usual kind of markup (descriptive markup) that merely specifies processing instructions: 0 Information is preserved; the identification of log- 0 Arbitrary processing instructions may be as- ical elements is not lost. signed to tags. This provides: - flexibility: the appearance of whole sets of documents may be changed instantly by sim- ply changing the processing instructions as- sociated with the tags; - portability: since platform dependent pro- cessing instructions are not embedded in de- scriptively marked-up documents, the porta- bility of documents is enhanced. 0 It is feasible to make “intelligent” queries on doc- uments. SGML is a metasyntax that is used for writing gen- eralized markup syntaxes. Ultimately, the rationale behind SGML is to provide mechanisms that allow documents to be described in such a way that they are easily portable across systems. In fact, the SGML language is becoming de facto the standard for struc- tured document creation and exchange. This paper presents a graphical environment for change detection in structured documents such as SGML and its extension HTML. The SGML and HTML are widely used to define document types for z 0730-3157/97 $10.00 zyxwvutsrqp 0 1997 IEEE 536