DIS G:A DIStributed Graph Repository for Web Infrastructure Aoying Zhou Institute of Massive Computing East China Normal University ayzhou@sei.ecnu.edu.cn Weining Qian Institute of Massive Computing East China Normal University wnqian@sei.ecnu.edu.cn Dao Tao Department of Computer Science and Engineering Fudan University taodao@fudan.edu.cn Qiang Ma Department of Computer Science and Engineering Fudan University maqiang@fudan.edu.cn Abstract Storage and utilization of web data has become a big challenge to data management community. Though many commercial and academic tools emerge, the structure, con- tent, and user behavior of Chinese Web is not fully studied. We are working on building a Chinese Web Infrastructure (CWI) for support of such research. In this paper, a graph data model used in CWI is introduced, after which the de- sign of a distributed system for management of data con- forming the model is presented. 1. Introduction World-Wide Web (WWW, or the Web) contains huge volume of data published by different organizations or peo- ple, which can be treat as the largest distributed database in the world. However, utilization of data on the Web is diffi- cult since the data is unstructured, heterogeneous, and fast changing. We propose to develop a Chinese Web Infrastruc- cture (CWI), for serving the requirements of analysis to the data from the Chinese Web [11]. The architecture of CWI is shown in Figure 1. Two major challenges to development of CWI are 1) the massive dataset to be coped with, and 2) the lack of com- mon schema. We propose a tagged and labeled graph model (TLGM) for representing the data on the Web, and to use a large-scaled distributed repository for storage and manage- ment of data conforming the model, so that high efficiency and availability can be achieved. This paper introduces the design of DisG, a repository for management of data conforming the TLGM. There are two characteristics distinguished it from other systems with similar goal: DisG supports queries on both content (via keywords) and structure of the data. Thus it is suitable for Web data management, in which data are unstructured or semi-structured, while information retrieval is impor- tant. DisG is a distributed repository. Both storage and data retrieval are conducted parrellelly. Therefore, it is scal- able in terms of volume of data and number of users. The rest part of this paper is organized as follows. In Section 2, the preliminaries of the TLGM data model and its query language is breifly introduced. Section 3 intro- duces the environment upon which DisG is developed. In Section 4, two major components of the system, i.e. the storage scheme of graph data, and graph reconstruction, are introduced. Finally, Section 5 is devoted to discussion on our future work on DisG. 2. A graph data model for Web infrastructure 2.1 The TLGM data model The tagged and labeled graph model (TLGM) is used in CWI [11]. Each data object in TLGM is a quadruple: < K, L, P, T >. Here, K is an ordered set of terms <k i >, L is a set of tags {t i }, P is the set of pointers to other data objects <l i ,p i > (l i is the label and p i is the reference to a data object), and T is the timestamp. A term may be a keyword or a URL link. This data model is essentially a directed graph. Each vertex is a data object, while edges are pointers in P . Each