Feature Enriched Nonparametric Bayesian Co-clustering Pu Wang, Carlotta Domeniconi, Huzefa Rangwala, and Kathryn B. Laskey George Mason University 4400 University Ave., Fairfax, VA 22030 USA {pwang7,carlotta,rangwala}@cs.gmu.edu klaskey@gmu.edu Abstract. Co-clustering has emerged as an important technique for mining relational data, especially when data are sparse and high-dimensional. Co- clustering simultaneously groups the different kinds of objects involved in a relation. Most co-clustering techniques typically only leverage the entries of the given contingency matrix to perform the two-way clustering. As a consequence, they cannot predict the interaction values for new objects. In many applications, though, additional features associated to the objects of interest are available. The Infinite Hidden Relational Model (IHRM) has been proposed to make use of these features. As such, IHRM has the capability to forecast relationships among previously unseen data. The work on IHRM lacks an evaluation of the improvement that can be achieved when leveraging features to make predictions for unseen objects. In this work, we fill this gap and re-interpret IHRM from a co-clustering point of view. We focus on the empirical evaluation of forecasting relationships between previously unseen objects by leveraging object features. The empirical evaluation demonstrates the effectiveness of the feature-enriched approach and identifies the conditions under which the use of features is most useful, i.e., with sparse data. Keywords: Bayesian Nonparametrics; Dirichlet Processes; Co-clustering; Protein-molecule interaction data 1 Introduction Co-clustering [11] has emerged as an important approach for mining relational data. Often, data can be organized in a matrix, where rows and columns present a symmetrical relation. Co-clustering simultaneously groups the different kinds of objects involved in a relation; for example, proteins and molecules indexing a contingency matrix that holds information about their interaction. Molecules are grouped based on their binding patterns to proteins; similarly, proteins are clustered based on the molecules they interact with. The two clustering processes are inter-dependent. Understanding these interactions provides insight into the underlying biological processes and is useful for designing therapeutic drugs. Existing co-clustering techniques typically only leverage the entries of the given contingency matrix to perform the two-way clustering. As a consequence, they cannot predict the interaction values for new objects. This greatly limits the applicability of current co-clustering approaches. In many applications additional features associated to the objects of inter- est are available, e.g., sequence information for proteins. Such features can be