Efficient Dynamic Indexing and Retrieval of XML Documents using Three- Dimensional Quasi-BitCube Biren Shah Abhilash Gummadi Jong P. Yoon Vijay Raghavan University of Louisiana at Lafayette, P. O. Box 44330, CACS, Lafayette, LA 70504, USA {bshah, axg1814, jyoon, raghavan}@cacs.louisiana.edu ABSTRACT XML is a new standard for exchanging and representing data on the Internet. Techniques for indexing and retrieval of XML data is drawing increasing attention since they enable one to access certain parts of retrieved documents easily. However, they provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, based on XML indexing and retrieval, operate in dynamic environments that require frequent additions to document collections. An indexing structure known as the BitCube has been proposed to perform fast query processing on XML documents. One of the major disadvantages in using a BitCube is its inefficient memory management. In this paper, we propose an extended BitCube, also known as a Quasi-BitCube, which manages memory much more effectively while maintaining the same query processing efficiency of a BitCube. Our work also aims at enabling dynamic (or incremental) indexing of new documents to an existing Quasi-BitCube, without requiring the entire collection to be re-indexed. We have performed an extensive set of experiments to test the effectiveness of both the Quasi-BitCube index structure and the proposed dynamic algorithm to create that indexing structure. The results show that, Quasi-BitCube manages memory much more efficiently than the BitCube, without compromising on the query processing time. Our results from the experiments to test the performance of our dynamic indexing algorithm show that it provides better update and search costs than earlier schemes like the one used in XQEngine, with acceptable space overheads. 1. INTRODUCTION A majority of traditional business applications, transactional systems and enterprise applications rely on relational databases to maintain their data. As portals, knowledge management systems and even e-mail have joined the mainstream and have become indispensable daily tools, a typical organization’s enterprise information is no longer maintained as structured data alone. Typically, structured data are the data with a repeated structure that can be easily stored in the data tables of a relational database. Semi-structured databases [1], unlike traditional databases, do not have a fixed schema known in advance. Broadly speaking, semi-structured data is self-describing and can model heterogeneity more naturally than either relational or object-oriented database systems. The eXtensible Markup Language (XML) [3] is a commonly used data modeling technique for such data, and the application of common XML tools blurs the distinction in handling structured and unstructured data. XML is a simplified subset of the Standard Generalized Markup Language (SGML). It provides a file format for representing data, a schema for describing data structure, and a mechanism for extending and annotating Hyper-Text Markup Language (HTML) with semantic information. The XML data model carries both data and schema information, being naturally suitable to represent semi-structured data. It is a standard for representing and exchanging information on the Internet. As XML is an evolving data representation format, the awareness and acquaintance of XML among the database developers and users is not adequate. As more and more data are being represented in XML format, more tools for maintaining the XML data are developed. As XML has become a part of critical databases, the performance of such tools has become a matter of concern. Research on indexing XML databases is being actively pursued and is delivering efficient and effective algorithms. The representation of documents in XML paved way for the possibility of content-based retrieval. The widespread use of XML in digital libraries, product catalogues, scientific data repositories and across the web prompted the development of appropriate searching and browsing methods for XML documents. As enterprise applications (or, web services) continue to build upon XML, it is critical that they include a search functionality that is fully compatible with XML. In order to optimize query processing, the data need to be organized (indexed) in a way that facilitates efficient retrieval. Without indexes, the database may be forced to conduct a full data scan to locate the desired data record, which can be a lengthy and an inefficient process. Additionally, modern applications operate in dynamic environments that require frequent additions to document collections. There is an urgent need for an XML indexing and retrieval technique that not only supports dynamic indexing of new documents but also aids in efficient query processing. The rest of the paper is organized as follows. Section 2 describes some of the related works in this area. Section 3 introduces some of the preliminary operations used elsewhere in the paper. The proposed indexing approach is described in Section 4. Section 5 describes the dynamic indexing algorithm for our index structure. In section 6, we provide experimental results to access the properties of indexing and dynamic indexing based on our approach and we compare it with previous approaches. In section 7, we summarize the results of our study, draw conclusions and identify future work. 2. RELATED WORK Among the types of indexes supported or under exploration by commercial database vendors are B+ trees [15], hash indexes [15], signature files [6], inverted files [21], latent semantic indexing [14] and R-trees [8]. These indexing techniques can be evaluated based on access/insertion/deletion time and disk-space needed. Each indexing technique differs in its implementation and target use and at the same time offers the potential to improve query performance for different applications. Although XML can support both structure and content- based information retrieval, efficient indexing is an important