Understanding and Modeling User Interests in Consumer Videos Ryoma Oami Multimedia Research Laboratories NEC Corporation r-oami@az.jp.nec.com Ana B. Benitez Dept. of Electrical Engineering Columbia University ana@ee.columbia.edu Shih-Fu Chang Dept. of Electrical Engineering Columbia University sfchang@ee.columbia. edu Nevenka Dimitrova Philips Research 345 Scarborough Road Briarcliff NY 10510 nevenka.dimitrova@ philips.com Abstract This paper analyzes the interests of users in viewing and organizing consumer videos. It proposes a taxon- omy of relevant concepts with three basic Dimensions Of Interest (DOIs) and effective models to predict the user interest in each dimension. The three DOIs corre- spond to the objects, the scenes and the events. Our conclusions are backed with an extensive study, in which users were asked to annotate and score the im- portance of each DOI in short clips of diverse and real consumer videos. Analysis of the user study data re- veals high consistency (70%) of the scores across dif- ferent users, higher importance of objects and events, and independence between objects and events. In addi- tion, we show how heuristic rules and neural networks can accurately predict these scores using camera mo- tion, foreground object and audio information. The automatic and effective prediction of user interests has the potential for improving automatic applications for annotating and summarizing consumer videos, among others. 1 Introduction In recent years, the increasing popularity of video cam- eras has stimulated the rapid accumulation of consumer videos. The lack of simple, fast and convenient tools and services to annotate, summarize and manage these consumer video archives, however, has drastically de- creased the usability of these videos. Most consumer videos are rarely or never watched after being recorded. Research on (semi-) automatic summarization and annotation of consumer videos is an emerging field within the multimedia community. The trend in con- sumer video summarization techniques is to select clips in the video randomly [4] or based on a one- dimensional "importance" score predicted from audio- visual features [1][2]. Probabilistic scene segmentation and clustering based on audio-visual features has also been proposed for accessing consumer videos [5]. The limitation of these approaches is the a priori definition of what is "important" or "similar" in consumer videos independent of users. There are several prior works proposing taxonomies and annotation schemes for ge- neric videos [3][6][7][8]; however, none of these ap- proaches have been specially tailored, developed or evaluated for consumer videos with real users. In this paper, we set out to explore what is important in con- sumer videos from the users’ perspective. This paper proposes a taxonomy of interesting con- cepts tailored to consumer videos based on a user study. The proposed taxonomy has three basic dimensions of interest (DOI), which correspond to 1) the objects (main characters or entities), 2) the scenes (composi- tions or aggregations of objects) and 3) the events (ac- tions, changes in objects and scenes, or happenings with special meaning). We conducted an extensive user study to evaluate the proposed taxonomy and, in par- ticular, the three DOIs. Subjects were asked to score the importance of each dimension, and to annotate with free text and/or the taxonomy's concepts several video clips. The video clips were selected from a diverse set of real consumer videos. Analysis of the user study data reveals high consistency (70%) of the scores across different users, higher importance of objects and events, and independence between objects and events. This paper also analyzes the influence of audio- visual features in DOI scores, and proposes effective prediction models based on simple heuristic rules and neural networks. Our findings point at panning/titling, few large foreground objects or zooming-in, and audio features (music, applause and cheers) to be good indi- cators of important scenes, objects, and events in con- sumer videos, respectively. Effective prediction of user interests in consumer videos can greatly advance anno- tation and summarization tools. For example, if only objects are important in a video clip, the annotation tool can focus on recognizing relevant objects (e.g., people). Consumer video summaries can now be edited following a meaningful and adaptable grammar. A summary can first introduce the main objects and later interleave important events and scenes. 2 Dimensions of Interests After inspecting several hours of real consumer videos, we realized that people naturally pay attention to mul- tiple aspects while watching consumer videos. In this first analysis, we concluded that objects, scenes and