Quality Evaluation of Computational Models for Movie Summarization A. Zlatintsi ∗ , P. Koutras ∗ , N. Efthymiou ∗ , P. Maragos ∗ , A. Potamianos ∗ , and K. Pastra † ∗ School of Electr. & Comp. Enginr., National Technical University of Athens, 15773, Athens, Greece † Cognitive Systems Research Institute, Athens, Greece Email: [nzlat,pkoutras,maragos]@cs.ntua.gr, [nefthymiou,potam]@central.ntua.gr, kpastra@csri.gr Abstract—In this paper we present a movie summarization system and we investigate what composes high quality movie summaries in terms of user experience evaluation. We propose state-of-the-art audio, visual and text techniques for the detection of perceptually salient events from movies. The evaluation of such computational models is usually based on the comparison of the similarity between the system-detected events and some ground-truth data. For this reason, we have developed the MovSum movie database, which includes sensory and semantic saliency annotation as well as cross-media relations, for objective evaluations. The automatically produced movie summaries were qualitatively evaluated, in an extensive human evaluation, in terms of informativeness and enjoyability accomplishing very high ratings up to 80% and 90%, respectively, which veriﬁes the appropriateness of the proposed methods. I. I NTRODUCTION Summarization refers to generating a shorter version of a video that includes as much as possible information re- quired for context understanding without sacriﬁcing much of the original informativeness and enjoyability. Automatic summaries can be generated either with key-frames, which correspond to the most important video frames and represent a static storyboard, or by video skims that include the most descriptive and informative video segments. Movie data are multimodal, containing visual, audio and textual streams, and many computational models have been proposed to estimate their multimodal saliency [1], [2], [3]. Besides their sensory cues, movies contain semantic events as well, whose modeling is difﬁcult using only bottom-up and data-driven techniques, thus it is usually needed to incorporate high-level information. There are many qualities that a movie has to include in order to give a pleasurable experience to the viewer. In exactly the same way, a movie summary, produced either by a human or automatically by a system, has to consist of features that will attract human attention, but also incorporate elements that assist the development of the plot. The features to be included and the techniques that are used for such a system are closely related to user experience. Hence, a computational summarization system could indeed beneﬁt and get further improved through qualitative human evaluations of the auto- matically produced summaries. First, the developer needs to know what is conspicuous and attracts human attention as well as to have some ground-truth data for quality testing This research work was supported by the project “COGNIMUSE” which is implemented under the “ARISTEIA” Action of the Operational Program Education and Lifelong Learning and is co-funded by the European Social Fund and Greek National Resources. of his/her methods. Likewise, at the ﬁnal stage he/she has to evaluate the system considering user responses and preferences in order to further improve it. The classical machine learning techniques can evidently assist such an evaluation, yet they cannot really account for the human factor. Nonetheless, hu- man perspective is needed for the implementation of systems that takes into consideration user preferences, and produce “user-deﬁned” summaries. In this paper, we present novel ways for the integration of user experience in movie summarization. Speciﬁcally, we propose a computational system for movie summarization and we introduce a movie database, enriched with salient event annotation in the sensory and semantic level The evaluation of the produced summaries is based both on a machine learning technique and on extensive qualitative user experience evaluations that verify the appropriateness of the proposed methods and the quality of the summaries. II. DATABASE DESCRIPTION Event detection and summarization algorithms can be sig- niﬁcantly improved when there is adequate data for training, adaptation and evaluation of their parameters. The evaluation of the developed computational models is usually based on the comparison of the similarity or correlation between the system- detected observations and some ground-truth data (annotated reference event observations) selected by experienced/trained users. For this reason, we developed the MovSum (Movie Summarization) Database, which at this point is still under de- velopment, and part of an involving multimodal video oriented database annotated with saliency, semantic events and cross- media relations. The database at its current state has been used for objective evaluation of the system-detected salient events. A. MovSum Database Annotated With Salient Events Data collection: The process of creating the dataset includes data collection, data conversion to a suitable format and anno- tation. Speciﬁcally, the dataset consists of half-hour continuous segments from seven movies (three and a half hours in total), namely: “A Beautiful Mind” (BMI), “Chicago” (CHI), “Crash” (CRA), “The Departed” (DEP), “Gladiator” (GLA), “Lord of the Rings - the Return of the King” (LOR) and the animation movie “Finding Nemo” (FNE) 1 . Oscar-winning movies from 1 Title, production year and production company of the seven movies: A Beautiful Mind 2001 (Universal & DreamWorks), Chicago 2002 (Miramax), Crash 2004 (Lions Gate), The Departed 2006 (Warner Bros.), Gladiator 2000 (Universal & DreamWorks), Lord of the Rings 2003 (New Line), Finding Nemo 2003 (Walt Disney Pictures, Pixar Animation Studios).