Extracting Community Structure Features for Hypertext Classification Dell Zhang SCSIS Birkbeck, University of London London WC1E 7HX, UK dell.z@ieee.org Robert Mao Microsoft Corp. EPDC5/2352, South County Business Park Leopardstown, Dublin, Ireland robmao@microsoft.com Abstract Standard text classification techniques assume that all documents are independent and identically distributed (i.i.d.). However, hypertext documents such as web pages are interconnected with links. How to take advantage of such links as extra evidences to enhance automatic clas- sification of hypertext documents is a non-trivial problem. We think a collection of interconnected hypertext documents can be considered as a complex network, and the underly- ing community structure of such a document network con- tains valuable clues about the right classification of docu- ments. This paper introduces a new technique, Modularity Eigenmap, that can effectively extract community structure features from the document network which is induced from document link information only or constructed by combin- ing both document content and document link information. A number of experiments on real-world benchmark datasets show that the proposed approach leads to excellent classifi- cation performance in comparison with the state-of-the-art methods. 1 Introduction Automatic text classification or categorization is a fun- damental problem in information retrieval and organiza- tion [9]. Standard text classification techniques assume that all documents are independent and identically distributed (i.i.d.). However, hypertext documents such as web pages are interconnected with links. Although we can easily em- ploy standard text classification techniques for the problem of hypertext classification, we anticipate that a better classi- fication performance could be achieved by exploiting addi- tional document link information. It is usually not promising to use the raw links directly as features. Due to the sparsity of document network, two documents in the same class may appear to be dissimilar on surface when they are not directly connected, but such a pair of documents are likely to be connected indirectly by some paths with a number of intermediate documents. In other words, it is important to exploit not only local link in- formation but also global structure information for effective hypertext classification. We think a collection of interconnected hypertext docu- ments can be considered as a complex network [7], and the underlying community structure [8] of such a document net- work contains valuable clues about the right classification of documents. A number of graph-based classification meth- ods have emerged in recent years (see Section 2). However, they are not really suitable for complex networks because they do not take the degree distribution of network into con- sideration. This paper introduces a new technique, Mod- ularity Eigenmap, that can effectively extract community structure features from the document network for enhanced hypertext classification. 2 Related Work Some hypertext classification methods rely solely on document link information. Previous research studies have shown that using raw links directly as features does not work well in practice [2, 12]. Inspired by the success of PageRank and HITS in Web search and mining, people have tried to apply link analysis techniques for hypertext classi- fication. Gy¨ ongyi et al. propose a technique similar to Per- sonalised PageRank for large-scale web page classification [5]. Zhou et al. propose a technique in the semi-supervised learning framework, Directed Graph Regularization (DGR), that combines the ‘hub’ and ‘authority’ information of web pages for their classification [11]. Some hypertext classification methods make use of both document content and document link information. The sim- plest method is to expand each document’s feature vector by incorporating the features of its neighbours (directly linked documents), but this method seems to have difficulties with parameter tuning and often does not provide a robust solu- tion [2]. Chakrabarti et al. propose to address the hypertext