Hierarchical Graph Indexing James Abello DIMACS Center, Rutgers University Piscataway, NJ abello@dimacs.rutgers.edu,abelloj@optonline.net Yannis Kotidis AT&T Labs-Research Florham Park, NJ kotidis@research.att.com ABSTRACT Traffic analysis, in the context of Telecommunications or Internet and Web data, is crucial for large network oper- ations. Data in such networks is often provided as large graphs with hundreds of millions of vertices and edges. We propose efficient techniques for managing such graphs at the storage level in order to facilitate its processing at the in- terface level(visualization). The methods are based on a hierarchical decomposition of the graph edge set that is in- herited from a hierarchical decomposition of the vertex set. Real time navigation is provided by an efficient two level indexing schema called the gkd * -tree. The first level is a variation of a kd-tree index that partitions the edge set in a way that conforms to the hierarchical decomposition and the data distribution (the gkd-tree). The second level is a redundant R * -tree that indexes the leaf pages of the gkd- tree. We provide computational results that illustrate the superiority of the gkd * -tree against conventional indexes like the kd-tree and the R * -tree both in creation as well as query response times. Categories and Subject Descriptors: H.3.m INFOR- MATION STORAGE AND RETRIEVAL: Miscellaneous. General Terms: Algorithms, Management, Design. Keywords: Graph, Navigation, Visualization, Index. 1. INTRODUCTION Telecommunications traffic [2], World-Wide Web [13] and Internet Data [16] are typical sources of graphs with sizes ranging from 1 million to several billion edges. These graphs are not only too large to fit on the screen but they are in gen- eral too large to fit in main memory. Therefore the screen and RAM sizes are the two main bottlenecks that we need to face in order to achieve reasonable processing and nav- igation. Recently, several mechanisms have been proposed to deal with both bottlenecks in a unified manner. They are based on the notions of Graph Macro-Views and Graph Sketches [1]. These approaches exploit the fact that the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’03, November 3–8, 2003, New Orleans, Louisiana, USA. Copyright 2003 ACM 1-58113-723-0/03/0011 ...$5.00. multi-graphs mentioned above are sparse, of low diameter and obey a power law distribution that is scale invariant [16]. We can view a weighted multi-digraph as a real non- negative matrix A whose entries are normalized in a suit- able fashion. Thus each matrix entry A(i, j ) represents some weighted function of the number of edges between vertices i and j . As an example, think of A as representing the US phone calls. The hierarchical grouping of these num- bers in blocks, neighborhoods, towns, counties, states and US regions, can be represented as a rooted tree T . This geographical based hierarchy can be used in turn to obtain ”aggregate” views of the phone traffic at different ”levels” of granularity, i.e. traffic between states, counties, cities, etc. Navigation from one level of the edge hierarchy to the next is provided by refinement or partial aggregation of the current view. For example, in Figure 1 (Figure from [4]), a height field is being used to represent the aggregate US states traffic matrix. When a particular entry is selected(like NJ-NJ) another height field representing the calling traffic between the NJ towns is brought into the screen. Other queries of interest involve computing traffic among entities at different levels of the hierarchy. For example, traffic from a town to a region of the US. All the traffic queries described above can be modeled as virtual weighted directed edges between tree vertices that are not descendants of each other. A maximal collection of these vertices corresponds to a partition of all the phone numbers. Each such partition together with all its virtual edges represents a Macro-View of the input graph. Each vir- tual edge represents the subgraph consisting of all the edges going from one set of the partition into another. Each such subgraph is what we call a subgraph slice. Subgraph slices are really the detailed views of the aggregated information recorded by the higher level virtual edges. Graph Sketches were first introduced in [1] and are incor- porated into a system called MGV [2]. The major ques- tion not addressed in previous research is how to obtain fast data access in the case that the input is coming as a graph stream. In such cases the entire neighborhood of vertexes is not known a priori and sorting the entire data set is not an available option. We further assume the existence of a rooted tree T whose set of leaves corresponds to the graph vertex set. When such tree T is not known a priory, several approaches for its computation have been proposed in [3]. Computing subgraph slices on demand over a graph stream requires fast access to the underlying data (in our example telephone calls) at different levels of granularity. This is pre- cisely our goal for designing an index, the gkd-tree, that fa-