Predictive Overlapping Co-Clustering Chandrima Sarkar University of Minnesota Twin Cities Minneapolis, Minnesota sarkar@cs.umn.edu Jaideep Srivastava University of Minnesota Twin Cities Minneapolis, Minnesota srivasta@umn.edu ABSTRACT In the past few years co-clustering has emerged as an im- portant data mining tool for two way data analysis. Co- clustering is more advantageous over traditional one dimen- sional clustering in many ways such as, ability to find highly correlated sub-groups of rows and columns. However, one of the overlooked benefits of co-clustering is that, it can be used to extract meaningful knowledge for various other knowl- edge extraction purposes. For example, building predictive models with high dimensional data and heterogeneous popu- lation is a non-trivial task. Co-clusters extracted from such data, which shows similar pattern in both the dimension, can be used for a more accurate predictive model building. Several applications such as finding patient-disease cohorts in health care analysis, finding user-genre groups in recom- mendation systems and community detection problems can benefit from co-clustering technique that utilizes the predic- tive power of the data to generate co-clusters for improved data analysis. In this paper, we present the novel idea of Predictive Over- lapping Co-Clustering (POCC) as an optimization problem for a more effective and improved predictive analysis. Our algorithm generates optimal co-clusters by maximizing pre- dictive power of the co-clusters subject to the constraints on the number of row and column clusters. In this paper precision, recall and f-measure have been used as evalua- tion measures of the resulting co-clusters. Results of our algorithm has been compared with two other well-known techniques - K-means and Spectral co-clustering, over four real data set namely, Leukemia, Internet-Ads, Ovarian can- cer and MovieLens data set. The results demonstrate the effectiveness and utility of our algorithm POCC in practice. Keywords Co-clustering, Predictive power, Simulated annealing 1. INTRODUCTION The real life data in general can be considered dyadic in nature, i.e. the data can be represented as a two dimensional matrix with rows and column being two separate entities of interest. Some common examples include co-occurrence matrix, rating matrix, and proximity matrix. An important problem in dyadic data analysis is finding block structures hidden in the data matrix. Finding hidden blocks of data can be beneficial in several applications. For example, we may be interested in finding groups of patients that show similar activity patter under a specific subset of health care conditions [29], simultaneously clustering movies and user ratings in collaborative filtering [13], finding document and word clusters in text clustering [12], grouping genes with similar properties based on their expression patterns under various conditions or across different tissue samples in bio- informatics [6, 8]. Co-clustering is an important and efficient solution for this purpose that exploits the duality between data point and features by grouping them based on their distribution over the other (data points or features) [11, 16]. Most of the co-clustering algorithm focuses on finding co- clusters with single membership of a data point in the data matrix [12, 2]. Although these techniques generate efficient results over real data set, these algorithms are based on the assumption that, a single data point can belong to only one cluster. This assumption is often not completely valid since, in real life there is a high probability that a single data point belongs to multiple clusters with varying degree of its membership with the clusters. For example, in recommen- dation system a group of user may prefer pop music as well as country music. In fact, several real life situations that deal with high dimensional data with heterogeneous popu- lation can benefit more from finding co-clusters that overlap each other. One important example can be finding co-cluster from Electronic Health Records or EHR (hospital data) for predictive analysis in health care. EHR data in health care is often high dimensional with heterogeneous population that makes co-clustering a suitable approach for finding groups of patients and disease conditions. However, each of these co-clusters of patient-disease condition should reflect patient sub-populations that potentially share co-morbid diagnoses as shown in Figure 1. Hence, in this scenario detecting over- lapping co-clusters would help capture the most utilizable pattern that exist in the data. There are past researches that developed different approaches of generating co-clusters such as bi-partite graphs [11] or model based [3] co-clustering techniques. However, develop- arXiv:1403.1942v1 [cs.LG] 8 Mar 2014