Learning Mixtures of Multi-Output Regression Models by Correlation Clustering for Multi-View Data Eric Lei Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 Kyle Miller Auton Lab Carnegie Mellon University Pittsburgh, PA 15213 Artur Dubrawski Auton Lab Carnegie Mellon University Pittsburgh, PA 15213 Abstract In many datasets, different parts of the data may have their own patterns of correlation, a structure that can be modeled as a mixture of local linear correlation models. The task of finding these mixtures is known as correlation clustering. In this work, we propose a linear cor- relation clustering method for datasets whose features are pre-divided into two views. The method, called Canonical Least Squares (CLS) clustering, is inspired by multi-output regres- sion and Canonical Correlation Analysis. CLS clusters can be interpreted as variations in the regression relationship between the two views. The method is useful for data mining and data interpretation. Its utility is demonstrated on a synthetic dataset and stock market dataset. 1 INTRODUCTION A common problem in data analysis is to investigate cor- relation structure. In many datasets, different parts of the data may have their own patterns of correlation. In gen- eral, clustering data based on local correlations is known as correlation clustering (Klami and Kaski, 2008; Zimek, 2009) (not to be confused with a machine learning graph problem of the same name). Additionally, there may be global nonlinear correlation structure in data. Both issues may be solved by mixing local linear correlation models and identifying them using a clustering method. In this work, we develop a linear correlation clustering method for datasets whose features are pre-divided into two views. These views can be arbitrary but usually correspond to two distinct facets of the data. This kind of duality oc- curs frequently in the real world: important examples include genes and diseases (Seoane et al., 2014), visu- als and text (Rasiwasia et al., 2010), and emotions and personality disorders (Sherry and Henson, 2005). If the views are considered input and output, data of this form can be a natural candidate for multi-output regression. We propose a novel technique inspired by multi-output regres- sion called Canonical Least Squares (CLS) and apply it to clustering; CLS clusters can be interpreted as variations in the regression relationship between input and output views. The method is demonstrated on a synthetic dataset and stock market dataset. Figure 1: Histogram of pre- and post-crisis returns of Alexion Pharmaceuticals, a pharmaceutical company. We now explore a motivating example involving the stock market. One way to have two views of a time series such as stock returns is to consider temporal windows before and after some event. In our case, we consider the late 2000s financial crisis, which fundamentally altered some facets of the US economy. We hypothesize that the behavior of some stocks changed as a result of the crisis. For instance, Fig. 1 illustrates how the distribution of returns of one company differed before and after the crisis. The distribution became narrower and more symmetric and increased in mean. Granted, there are many factors arXiv:1709.05602v1 [stat.ML] 17 Sep 2017