IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 10, OCTOBER 2012 1485 Sparse Unsupervised Dimensionality Reduction for Multiple View Data Yahong Han, Fei Wu, Dacheng Tao, Member, IEEE, Jian Shao, Yueting Zhuang, Member, IEEE, and Jianmin Jiang Abstract —Different kinds of high-dimensional visual features can be extracted from a single image. Images can thus be treated as multiple view data when taking each type of extracted high-dimensional visual feature as a particular understanding of images. In this paper, we propose a framework of sparse unsupervised dimensionality reduction for multiple view data. The goal of our framework is to ﬁnd a low-dimensional optimal consensus representation from multiple heterogeneous features by multiview learning. In this framework, we ﬁrst learn low- dimensional patterns individually from each view, considering the speciﬁc statistical property of each view. We construct a low- dimensional optimal consensus representation from those learned patterns, the goal of which is to leverage the complementary nature of the multiple views. We formulate the construction of the low-dimensional consensus representation to approximate the matrix of patterns by means of a low-dimensional consensus base matrix and a loading matrix. To select the most discriminative features for the spectral embedding of multiple views, we propose to add an  1 -norm into the loading matrix’s columns and impose orthogonal constraints on the base matrix. We develop a new al- ternating algorithm, i.e., spectral sparse multiview embedding, to efﬁciently obtain the solution. Each row of the loading matrix en- codes structured information corresponding to multiple patterns. In order to gain ﬂexibility in sharing information across subsets of the views, we impose a novel structured sparsity-inducing norm penalty on the loading matrix’s rows. This penalty makes the loading coefﬁcients adaptively load shared information across subsets of the learned patterns. We call this method structured sparse multiview dimensionality reduction. Experiments on a toy benchmark image data set and two real-world Web image data sets demonstrate the effectiveness of the proposed algorithms. Index Terms—Multiple view data, structured sparsity, video and image classiﬁcation. Manuscript received August 7, 2011; revised November 1, 2011 and January 10, 2012; accepted February 19, 2012. Date of publication June 1, 2012; date of current version September 28, 2012. This work was supported in part by the National Basic Research Program of China, under Grant 2012CB316400, in part by the Natural Science Foundation of China, under Grants 60833006 and 61070068, and in part by the Fundamental Research Funds for the Central Universities. The work of Y. Han was supported in part by the Scholarship Award for Excellent Doctoral Student granted by the Ministry of Education, China. This paper was recommended by Associate Editor R. Rinaldo. Y. Han is with the School of Computer Science and Technology, Tianjin University, Tianjin 300072, China (e-mail: hanyahong@gmail.com). F. Wu, J. Shao, and Y. Zhuang are with the College of Computer Science, Zhejiang University, Zhejiang 310058, China (e-mail: wufei@zju.edu.cn; yzhuang@zju.edu.cn; jshao@zju.edu.cn). D. Tao is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney NSW 2007, Australia (e-mail: dacheng.tao@uts.edu.au). J. Jiang is with the School of Computer Science and Technology, Tianjin University, Tianjin 300072, China, and also with the University of Surrey, Surrey GU2 7XH, U.K. (e-mail: jmjiang@tju.edu.cn). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TCSVT.2012.2202075 I. Introduction I N THE CASE of real-world images, we can extract dif- ferent kinds of high-dimensional visual features from any given image. For example, we can simultaneously extract color, texture, and shape features from one image, and each image sample can therefore be represented by different types of visual features. In this paper, we take each representation of one type of visual feature as a view of image data; thus, the extracted multiple kinds of high-dimensional visual features are taken as multiple views of one image. In the high- dimensional space of each view’s representation, it is hard to discriminate images of different classes or images that are annotated by different labels from one another, which results in the so-called “curse of dimensionality” problem [1]. Over the past few years, a large family of dimensionality reduction and manifold learning algorithms [2]–[5] have been proposed in an attempt to ﬁnd an appropriate low-dimensional subspace or manifold of the high-dimensional image data. The goal is to better characterize the images and discriminate among them in such low-dimensional representations. Furthermore, since different views (visual features) have their own speciﬁc sta- tistical properties, different visual features may have different discriminative powers for the task of image classiﬁcation or image annotation. In computer vision and multimedia research, many works [6]–[9] have shown that leveraging information contained in multiple views potentially has a discriminative advantage over only a single view. This paper targets learn- ing a low-dimensional optimal consensus representation from multiple views of image data to obtain better performance of image classiﬁcation and image annotation in such learned low- dimensional representations. One traditional solution for multiple view data is to concate- nate vectors of different views as a new vector and then apply machine learning algorithms directly on the concatenated vector. However, this concatenation will ignore the comple- mentary nature and speciﬁc statistical properties of different views. The unsupervised canonical correlation analysis [10] makes use of two views of the same underlying semantic object to extract a common representation. In order to perform multiview learning beyond the limit of two views, much effort has been focused on multiview clustering [11]–[13], multiview classiﬁcation [14], [15], multiview semisupervised (transduc- tive) learning [13], [16], and even multiview dimensionality reduction [12], [17]. In dimensionality reduction of multiple view data, two key issues should be considered. First, since different views have their own speciﬁc statistical properties, 1051-8215/$31.00 c  2012 IEEE