EXPLOITING MULTIVIEW PROPERTIES IN SEMI-SUPERVISED VIDEO CLASSIFICATION Mahmood Karimian, Mostafa Tavassolipour, Shohreh Kasaei Sharif University of Technology ABSTRACT In large databases, availability of labeled training data is mostly prohibitive in classification. Semi-supervised algorithms are employed to tackle the lack of labeled training data problem. Video databases are the epitome for such a scenario; that is why semi-supervised learning has found its niche in it. Graph-based methods are a promising platform for semi-supervised video classification. Based on the multiview characteristic of video data, different features have been proposed (such as SIFT, STIP and MFCC) which can be utilized to build a graph. In this paper, we have proposed a new classification method which fuses the results of manifold regularization over different graphs. Our method acts like a co-training method with respect to its iterative nature which tries to find the labels of unlabeled data during each iteration, but unlike co-training methods it takes into account the unlabeled data in classification procedure. The fusion is done after manifold regularization with a ranking method which makes the algorithm to be competitive with supervised methods. Our experimental results run on the CCV database show the efficiency of the proposed method. Index Terms — Semi-supervised learning, manifold regularization, co-training, video classification, multiview features. 1. INTRODUCTION Video is inundating different databases most significantly web databases. For instance, YouTube is the second largest search engine and the number of videos uploaded on it, is dramatically increasing day by day [1], [2]. Therefore, existence of methods by which users can retrieve their desired videos fast and accurately is inevitable. On the other hand, labeling sufficient videos using human force is irrational; considering the huge volume of available video data. Thus, semi-supervised learning methods which try to find proper labels for data when there are insufficient labeled data available, must be employed to tackle this problem [3]. A special characteristic of video data is their multiview property; i.e., extracting features from them could be based on their visual, audible, motional, or textual properties [2], [4]. Co-training which exploits the multiview property of video is one of the leading semi-supervised learning methods. The underlying assumptions in co-training methods include the independency of views for a given class and the ability of each view to classify data to some extent [5]. Co-training method is used in [6] for video concept detection. In that method, the concepts of videos are detected using two views. Each of the views has a limited ability to detect the concepts. The views are combined as a supplement to each other, so that the combination leads to labeling accuracy increase. As of now, the engine of classification in co-training methods has been SVM, neural networks, and naive Bayes (supervised classifiers). In [7] a multiview regularization method has been proposed to train the classifier. In general, the trend in semi-supervised multiview algorithms is to apply an iterative classification using only labeled data and assign label to unlabeled data in each step. The iteration is continued until all data are labeled [8], [9], [10]. Graph-based algorithms are also one of the promising semi-supervised learning methods. Two fundamental assumptions in graph-based methods are [11]: Samples of high similarity measures tend to have the same labels. The estimated label of initial labeled samples should be as close as possible to their real labels. There are numerous graph-based semi-supervised learning methods available; such as manifold regularization [11] and local and global consistency [12]. In this paper, we introduce a method in which the multiview property of video data is combined with graph- based semi-supervised algorithms. The underlying assumption of our algorithm is that data are scattered over a manifold for different views. This is a rationale assumption as discussed in [3], furthermore since in our algorithm multi-view property is utilized even if data in some of the views are not structured as manifold our algorithm would be still robust. The merits of each view are employed in our method using a ranking fusion method. First, a graph is formed by using each of the existing views, and then a ranking-based decision criterion over manifold regularization output of each graph is utilized to label the most efficient samples in our ranking metric. The algorithm