Multiview Learning with Labels Tom Diethe, David R. Hardoon, John Shawe-Taylor University College London {t.diethe,d.hardoon,jst}@cs.ucl.ac.uk Abstract CCA can be seen as a multiview extension of PCA, in which information from two sources is used for learning by ﬁnding a subspace in which the two views are most correlated. However PCA, and by extension CCA, does not use label information. Fisher Linear Discriminant Analysis uses label information to ﬁnd informative projections, which results . We show that LDA and its dual can both be formulated as generalized eigenproblems, enabling a kernel formulation. We derive a regularised two-view equivalent of Fisher Linear Discriminant (LDA-2) and its corresponding dual (LDA-2K), both of which can also be formulated as generalized eigenproblems. 1 Introduction The motivation for this paper comes from the desire to combine multiple sources of information in a learning framework where labels are known. Canonical correlation analysis (CCA), introduced by Hotelling in 1936 [1], is a method of correlating linear relationships between two sets of multidi- mensional variables. CCA makes use of two views of the same underlying semantic object to extract a common representation of the semantics. Kernel CCA (KCCA), which is a generalised form of kernel independent components analysis [2], is a nonlinear version of CCA which allows nonlinear relations to be found between multivariate variables effectively [3]. However both CCA and KCCA are effectively unsupervised techniques, and as such are not ideally suited to a classiﬁcation setting. A common way of performing classiﬁcation on two-view data using KCCA is to use the projected data from one of the views as input to a standard classiﬁcation algorithm, such as a Support Vector Machine (SVM). However, the subspace that is learnt through such unsupervised methods may not always align well with the label space. 1.1 Subspace Learning In standard single view subspace learning, a parallel can be drawn between subspace projections that are independent of the label space, such as Principal Components Analysis (PCA), and those that incorporate label information, such as Fisher Linear Discriminant Analysis (Fisher LDA). PCA searches for directions in the data that have largest variance and project the data onto a subset of these directions. In this way, we obtain a lower dimensional representation of the data that captures most of the variance. PCA is an unsupervised technique and as such does not include label information of the data. For instance, if we are given 2-dimensional data from two classes forming two long and thin clusters, such that the clusters are positioned in parallel and very closely together, the total variance ignoring the labels would be in the lengthways direction of the clusers. For classiﬁcation, this would be a poor projection, because the labels would be evenly mixed. A much more useful projection would be orthogonal to the clusters, i.e. in the direction of least overall variance, which would perfectly separate the two classes. We would then perform classiﬁcation in this 1-dimensional space. Fisher LDA would ﬁnd exactly this projection. 1