Using objective ground-truth labels created by multiple annotators for improved video classification: A comparative study Gaurav Srivastava ⇑ , Josiah A. Yoder, Johnny Park, Avinash C. Kak School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, USA article info Article history: Received 16 December 2011 Accepted 17 June 2013 Available online 13 July 2013 Keywords: Video classification Large dataset analysis Data annotation abstract We address the problem of predicting category labels for unlabeled videos in a large video dataset by using a ground-truth set of objectively labeled videos that we have created. Large video databases like YouTube require that a user uploading a new video assign to it a category label from a prescribed set of labels. Such category labeling is likely to be corrupted by the subjective biases of the uploader. Despite their noisy nature, these subjective labels are frequently used as gold standard in algorithms for multi- media classification and retrieval. Our goal in this paper is NOT to propose yet another algorithm that predicts labels for unseen videos based on the subjective ground-truth. On the other hand, our goal is to demonstrate that the video classification performance can be improved if instead of using subjective labels, we first create an objectively labeled ground-truth set of videos and then train a classifier based on such a ground-truth so as to predict objective labels for the set of unlabeled videos. With regard to how we generate the objectively-labeled ground-truth dataset, we base it on the notion that when a video is labeled by a panel of diverse individuals, the majority opinion rendered by the panel may be taken to be the objective opinion. In this manner, using judgments provided by multiple human annotators, we have collected objective labels for a ground-truth dataset consisting of randomly-selected 1000 videos from the TinyVideos database that contains roughly 52,000 videos from YouTube (courtesy of Karpenko and Aarabi [1]). Through a fourfold cross-validation experiment on the ground-truth set, we demonstrate that the objective labels have a superior consistency compared to the subjective labels when used for video clas- sification. We show that this claim is valid for several different kinds of feature sets that one can use to compare videos and with two different types of classifiers that one can use for label prediction. Subsequently, we use the ground-truth dataset of 1000 videos to predict the objective category labels of the remaining 51,000 videos. We compare the objective labels thus determined with the subjective labels provided by the video uploaders and qualitatively argue for the more informative nature of the objective labels. Ó 2013 Elsevier Inc. All rights reserved. 1. Introduction Massive amounts of image and video data has been and contin- ues to be uploaded to Internet based visual content databases like Picasa, Flickr, YouTube, Archive.org, Hulu and others. The content for these databases is generally created in realistic settings from sports and news coverage; documentaries on travel, science, and technology; social events; and so on. In recent years, this type of data has fast become the experimental data of choice in computer vision research because it encompasses large inter- and intra-class variability and presents very interesting challenges for problems like object detection and tracking, face recognition, human activity analysis, and so on. In this paper, we address the problem of consistently and objec- tively labeling the content in these large databases. Although applicable to all large video databases, our work focuses specifi- cally on the videos that are uploaded to the YouTube database. YouTube requires that every video uploaded to its servers be as- signed one of the 15 broad category labels listed in Table 1. Kar- penko and Aarabi [1] have noted that there is a significant amount of labeling noise in the categories assigned to the YouTube videos. The uploader may assign a category label based on his/her subjec- tive judgment about the theme of the video content. Also the label assigned to a video may be motivated by a particular section of the video, and the rest of the video may be totally unrelated to the la- bel. It may also be influenced by the motivation that led to creation of the video in the first place, or by the uploader’s opinion about 1077-3142/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.cviu.2013.06.009 ⇑ Corresponding author. E-mail addresses: email.gaurav.srivastava@gmail.com (G. Srivastava), yoderj@ gmail.com (J.A. Yoder), jpark@purdue.edu (J. Park), kak@purdue.edu (A.C. Kak). Computer Vision and Image Understanding 117 (2013) 1384–1399 Contents lists available at SciVerse ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu