A Probabilistic Framework for Relational Clustering Bo Long Computer Science Dept. SUNY Binghamton Binghamton, NY 13902 blong1@binghamton.edu Zhongfei (Mark) Zhang Computer Science Dept. SUNY Binghamton Binghamton, NY 13902 zzhang@binghamton.edu Philip S. Yu IBM Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 psyu@us.ibm.com ABSTRACT Relational clustering has attracted more and more attention due to its phenomenal impact in various important appli- cations which involve multi-type interrelated data objects, such as Web mining , search marketing, bioinformatics, cita- tion analysis, and epidemiology. In this paper, we propose a probabilistic model for relational clustering, which also provides a principal framework to unify various important clustering tasks including traditional attributes-based clus- tering, semi-supervised clustering, co-clustering and graph clustering. The proposed model seeks to identify cluster structures for each type of data objects and interaction pat- terns between different types of objects. Under this model, we propose parametric hard and soft relational clustering algorithms under a large number of exponential family dis- tributions. The algorithms are applicable to relational data of various structures and at the same time unifies a number of stat-of-the-art clustering algorithms: co-clustering algo- rithms, the k-partite graph clustering, and semi-supervised clustering based on hidden Markov random fields. Categories and Subject Descriptions: E.4 [Coding and Information Theory]:Data compaction and compres- sion; H.3.3[Information search and Retrieval]:Clustering; I.5.3[Pattern Recognition]:Clustering. General Terms: Algorithms. Keywords: Clustering, Relational data, Relational clus- tering, Semi-supervised clustering, EM-algorithm, Bregman divergences, Exponential families. 1. INTRODUCTION Most clustering approaches in the literature focus on ”flat” data in which each data object is represented as a fixed- length attribute vector [38]. However, many real-world data sets are much richer in structure, involving objects of multi- ple types that are related to each other, such as documents and words in a text corpus, Web pages, search queries and Web users in a Web search system, and shops, customers, suppliers, shareholders and advertisement media in a mar- keting system. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’07, August 12–15, 2007, San Jose, California, USA. Copyright 2007 ACM 978-1-59593-609-7/07/0008 ...$5.00. In general, relational data contain three types of infor- mation, attributes for individual objects, homogeneous re- lations between objects of the same type, heterogeneous re- lations between objects of different types. For example, for a scientific publication relational data set of papers and au- thors, the personal information such as affiliation for authors are attributes; the citation relations among papers are ho- mogeneous relations; the authorship relations between pa- pers and authors are heterogeneous relations. Such data violate the classic IID assumption in machine learning and statistics and present huge challenges to traditional cluster- ing approaches. An intuitive solution is that we transform relational data into flat data and then cluster each type of objects independently. However, this may not work well due to the following reasons. First, the transformation causes the loss of relation and structure information [14]. Second, traditional clustering approaches are unable to tackle influence propagation in clustering relational data, i.e., the hidden patterns of differ- ent types of objects could affect each other both directly and indirectly (pass along relation chains). Third, in some data mining applications, users are not only interested in the hid- den structure for each type of objects, but also interaction patterns involving multi-types of objects. For example, in document clustering, in addition to document clusters and word clusters, the relationship between document clusters and word clusters is also useful information. It is difficult to discover such interaction patterns by clustering each type of objects individually. Moreover, a number of important clustering problems, which have been of intensive interest in the literature, can be viewed as special cases of relational clustering. For example, graph clustering (partitioning) [7, 42, 13, 6, 20, 28] can be viewed as clustering on singly-type relational data consisting of only homogeneous relations (represented as a graph affin- ity matrix); co-clustering [12, 2] which arises in important applications such as document clustering and micro-array data clustering, can be formulated as clustering on bi-type relational data consisting of only heterogeneous relations. Recently, semi-supervised clustering [46, 4] has attracted significant attention, which is a special type of clustering us- ing both labeled and unlabeled data. In section 5, we show that semi-supervised clustering can be formulated as clus- tering on singly-type relational data consisting of attributes and homogeneous relations. Therefore, relational data present not only huge challenges to traditional unsupervised clustering approaches, but also great need for theoretical unification of various clustering tasks. In this paper, we propose a probabilistic model for relational clustering, which also provides a principal frame- work to unify various important clustering tasks includ-