BitCube: Clustering and Statistical Analysis for XML Documents Jong Yoon Ctr. for Advanced Computer Studies University of Louisiana Lafayette, LA 70504-4330 (337) 482-6765 jyoon@cacs.usl.edu Vijay Raghavan Ctr. for Advanced Computer Studies University of Louisiana Lafayette, LA 70504-4330 (337) 482-6603 raghavan@cacs.usl.edu Larry Kerschberg Dept. of Information & Software Eng. George Mason University Fairfax, VA 23010-4444 (703) 993-1661 kersch@gmu.edu ABSTRACT In this paper, we describe a new bitmap indexing technique to cluster XML documents. XML is a new standard for exchanging and representing information on the Internet. Documents can be hierarchically represented by XML-elements. XML documents are represented and indexed using a bitmap indexing technique. We define the similarity and popularity operations available in bitmap indexes and propose a method for partitioning a XML document set. Furthermore, a 2-dimensional bitmap index is extended to a 3- dimensional bitmap index, called BitCube. We define statistical measurements in the BitCube: mean, mode, standard derivation, and correlation coefficient. Based on these measurements, we also define the slice, project, and dice operations on a BitCube. BitCube can be manipulated efficiently and improves the performance of document retrieval. Keywords Document clustering, Bitmap indexing, Bit-wise Operations. 1. INTRODUCTION EXtensible Markup Language (XML) is a standard for representing and exchanging information on the Internet. As such, documents can be represented in XML and therefore content-based retrieval is possible. However, because the size of XML documents is very large and the types vary, typical information retrieval techniques such as LSI (Latent Semantic Index) [7] are not satisfactory. Information retrieval on the Web is not satisfactory due to partly poor quality of retrieved results[5]. We consider a document database (D). Each document (d) is represented in XML. So, d contains XML-elements (p), where p has zero or more terms (w) bound to it. Typical indexing requires a frequency table that is a two-dimensional matrix indicating the number of occurrence of the terms used in documents. Similarly, we use a three-dimensional matrix that consists of (d, p, w). We also treat a pair (p, w) as a query. Given a pair (p, w), we want to find d from a document database that is a triplet (d, p, w). In many cases on the Internet, this query answering is often too slow. A simple way to speed up query answering is to speed up the distance calculations from the well-organized document clusters. In this paper, we propose a bitmap indexing technique, which we call “BitCube,” that represents (d, p, w), and the operations which can cluster such documents efficiently. Before going further, consider following examples. 1.1 Motivating Examples EXAMPLE 1: Suppose that a query Q1 is posed to find all documents that describe “XML” in any figure caption(s) of subsections. This type of queries cannot easily be processed in relational document databases or object-oriented document databases due to inflexible modeling of irregularity of documents and unacceptable performance. However, in XML, irregularity of elements can be flexibly represented as shown in Figure 1. To increase performance, bitmap indexing of XML documents will be used. EXAMPLE 2: Suppose that a query Q2 is posed to find all documents that describe “XML” in more than one sub-subsection. Notice that this type of queries asks for a specific document structure, that is, not for section, nor for subsection, but for sub- subsections. Searching an entire XML database is costly because XML documents in the database are not represented regularly. To resolve this irregularity, bitmap indexes generated in the previous example, EXAMPLE 1, will be clustered. In this way, searching within only a cluster, if not all, may improve the performance. 1.2 Related Work The conventional techniques used for document retrieval systems include stop lists, word stems, and frequency tables. The words that are deemed “irrelevant” to a query are eliminated for searching. The words that share a common word stem are replaced by the stem word. A frequency table is a matrix that indicates the occurrences of words in documents. The occurrence here can be simply the frequency of a word or the ratio of word frequency with respect to the size of a document. However, the size of frequency table increases dramatically as the size of the document database increases. To reduce frequency tables, a latent semantic indexing (LTI) technique has been developed [7]. LTI retains only “most significant” of the frequency table. Although the SVD trick reduces the size of the original frequency table, finding such a singular matrix is not trivial. Instead, this paper considers a more complex frequency table that represents terms (or values) for an XML-element path used in an XML document. We describe a novice approach to decompose a frequency table if the table is a sparse matrix. In addition, a new data structure, called X-tree, has been introduced for storing very high dimensional data [1]. Inverted indexes have been studied extensively [8]. Fast insertion algorithms on inverted indexes have been proposed [9].