Extracting Community Structure Features for Hypertext Classiﬁcation Dell Zhang SCSIS Birkbeck, University of London London WC1E 7HX, UK dell.z@ieee.org Robert Mao Microsoft Corp. EPDC5/2352, South County Business Park Leopardstown, Dublin, Ireland robmao@microsoft.com Abstract Standard text classiﬁcation techniques assume that all documents are independent and identically distributed (i.i.d.). However, hypertext documents such as web pages are interconnected with links. How to take advantage of such links as extra evidences to enhance automatic clas- siﬁcation of hypertext documents is a non-trivial problem. We think a collection of interconnected hypertext documents can be considered as a complex network, and the underly- ing community structure of such a document network con- tains valuable clues about the right classiﬁcation of docu- ments. This paper introduces a new technique, Modularity Eigenmap, that can effectively extract community structure features from the document network which is induced from document link information only or constructed by combin- ing both document content and document link information. A number of experiments on real-world benchmark datasets show that the proposed approach leads to excellent classiﬁ- cation performance in comparison with the state-of-the-art methods. 1 Introduction Automatic text classiﬁcation or categorization is a fun- damental problem in information retrieval and organiza- tion [9]. Standard text classiﬁcation techniques assume that all documents are independent and identically distributed (i.i.d.). However, hypertext documents such as web pages are interconnected with links. Although we can easily em- ploy standard text classiﬁcation techniques for the problem of hypertext classiﬁcation, we anticipate that a better classi- ﬁcation performance could be achieved by exploiting addi- tional document link information. It is usually not promising to use the raw links directly as features. Due to the sparsity of document network, two documents in the same class may appear to be dissimilar on surface when they are not directly connected, but such a pair of documents are likely to be connected indirectly by some paths with a number of intermediate documents. In other words, it is important to exploit not only local link in- formation but also global structure information for effective hypertext classiﬁcation. We think a collection of interconnected hypertext docu- ments can be considered as a complex network [7], and the underlying community structure [8] of such a document net- work contains valuable clues about the right classiﬁcation of documents. A number of graph-based classiﬁcation meth- ods have emerged in recent years (see Section 2). However, they are not really suitable for complex networks because they do not take the degree distribution of network into con- sideration. This paper introduces a new technique, Mod- ularity Eigenmap, that can effectively extract community structure features from the document network for enhanced hypertext classiﬁcation. 2 Related Work Some hypertext classiﬁcation methods rely solely on document link information. Previous research studies have shown that using raw links directly as features does not work well in practice [2, 12]. Inspired by the success of PageRank and HITS in Web search and mining, people have tried to apply link analysis techniques for hypertext classi- ﬁcation. Gy¨ ongyi et al. propose a technique similar to Per- sonalised PageRank for large-scale web page classiﬁcation [5]. Zhou et al. propose a technique in the semi-supervised learning framework, Directed Graph Regularization (DGR), that combines the ‘hub’ and ‘authority’ information of web pages for their classiﬁcation [11]. Some hypertext classiﬁcation methods make use of both document content and document link information. The sim- plest method is to expand each document’s feature vector by incorporating the features of its neighbours (directly linked documents), but this method seems to have difﬁculties with parameter tuning and often does not provide a robust solu- tion [2]. Chakrabarti et al. propose to address the hypertext