Combining Content and Link for Classification using Matrix Factorization Shenghuo Zhu Kai Yu Yun Chi Yihong Gong {zsh,kyu,ychi,ygong}@sv.nec-labs.com NEC Laboratories America, Inc. 10080 North Wolfe Road SW3-350 Cupertino, CA 95014, USA ABSTRACT The world wide web contains rich textual contents that are intercon- nected via complex hyperlinks. This huge database violates the as- sumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by ex- ploiting both the content and the link structure. The research in this direction has recently received considerable attention but are still in an early stage. Though a few methods exploit both the link struc- ture or the content information, some of them combine the only authority information with the content information, and the others first decompose the link structure into hub and authority features, then apply them as additional document features. Being practically attractive for its great simplicity, this paper aims to design an algo- rithm that exploits both the content and linkage information, by car- rying out a joint factorization on both the linkage adjacency matrix and the document-term matrix, and derives a new representation for web pages in a low-dimensional factor space, without explicitly separating them as content, hub or authority factors. Further anal- ysis can be performed based on the compact representation of web pages. In the experiments, the proposed method is compared with state-of-the-art methods and demonstrates an excellent accuracy in hypertext classification on the WebKB and Cora benchmarks. Categories and Subject Descriptors: H.3.3 [Information Sys- tems]: Information Search and Retrieval General Terms: Algorithms, Experimentation Keywords: Link structure, Text content, Factor analysis, Matrix factorization 1. INTRODUCTION With the advance of the World Wide Web, more and more hyper- text documents become available on the Web. Some examples of such data include organizational and personal web pages (e.g, the WebKB benchmark data set, which contains university web pages), research papers (e.g., data in CiteSeer), online news articles, and customer-generated media (e.g., blogs). Comparing to data in tra- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’07, July 23–27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. ditional information management, in addition to content, these data on the Web also contain links: e.g., hyperlinks from a student’s homepage pointing to the homepage of her advisor, paper citations, sources of a news article, comments of one blogger on posts from another blogger, and so on. Performing information management tasks on such structured data raises many new research challenges. In the following discussion, we use the task of web page classifi- cation as an illustrating example, while the techniques we develop in later sections are applicable equally well to many other tasks in information retrieval and data mining. For the classification problem of web pages, a simple approach is to treat web pages as independent documents. The advantage of this approach is that many off-the-shelf classification tools can be directly applied to the problem. However, this approach re- lies only on the content of web pages and ignores the structure of links among them. Link structures provide invaluable information about properties of the documents as well as relationships among them. For example, in the WebKB dataset, the link structure pro- vides additional insights about the relationship among documents (e.g., links often pointing from a student to her advisor or from a faculty member to his projects). Since some links among these documents imply the inter-dependence among the documents, the usual i.i.d. (independent and identical distributed) assumption of documents does not hold any more. From this point of view, the traditional classification methods that ignore the link structure may not be suitable. On the other hand, a few studies, for example [25], rely solely on link structures. It is however a very rare case that content informa- tion can be ignorable. For example, in the Cora dataset, the content of a research article abstract largely determines the category of the article. To improve the performance of web page classification, there- fore, both link structure and content information should be taken into consideration. To achieve this goal, a simple approach is to convert one type of information to the other. For example, in spam blog classification, Kolari et al. [13] concatenate outlink features with the content features of the blog. In document classification, Kurland and Lee [14] convert content similarity among documents into weights of links. However, link and content information have different properties. For example, a link is an actual piece of evi- dence that represents an asymmetric relationship whereas the con- tent similarity is usually defined conceptually for every pair of doc- uments in a symmetric way. Therefore, directly converting one type of information to the other usually degrades the quality of informa- tion. On the other hand, there exist some studies, as we will discuss in detail in related work, that consider link information and content information separately and then combine them. We argue that such an approach ignores the inherent consistency between link and con-