Scene Intensity Estimation and Ranking for Movie Scenes Through Direct Content Analysis Saurabh Kataria (12807637) Department of Electrical Engineering Indian Institute of Technology, Kanpur Email: saurabhk@iitk.ac.in Abhay Kumar (12011) Department of Electrical Engineering Indian Institute of Technology, Kanpur Email: abhayk@iitk.ac.in Abstract—In this project, an approach for scene inten- sity estimation and subsequently, ranking of movie scenes based on extracted features such as scene length, harmonic- ity, and motion energy is implemented and experimented with. Such a ranking can be used for automatic trailer generation, movie summarization and characterizing the emotional content of multimedia content. The ranking can be used to learn the intrinsic parameters of a user by processing the ratings he/ she has provided to multimedia contents on website like netﬂix. This in turn allows us to index and retrieve content according to a given user’s proﬁle. A dataset of 3 movies was constructed where the movie was broken into “scenes” manually and top 10 critical scenes were marked. Experimentation of including facial emotion response is also addressed. Results show that a simple combination of audio-visual features, either individually or combined, can fairly reliably be used to predict the intensity of a scene. The validation is done by comparing its performance with the manually annotated ground truth. I. Introduction Scene intensity estimation is a sub-task in automated multimedia content analysis. It is also closely related to emotion prediction in movies. As the name suggests, it aims at estimating the intensity of scenes/ shots in a given multimedia content (for example, commercial movies). Since scene intensity is a subjective quantity, it can be quantiﬁed by various low-level multimodal features. One particular application in the ﬁeld of computational social science and media informatics is that of estimating gender representation in a movie. A. Motivation Can we predict how intense a scene is in a movie? Can we improve the existing models for that task? While intensity can be understood as a measure of excitement or activity in a scene, there are several questions that are much harder to answer. (a) Can we come up with a computational model which can well approximate the actual intensity humans feel after watching the scene? (b) What will be the factors required for estimating that? For example, emotional and music intensive scenes will increase the intensity. Based on some psychological ﬁndings, several attempts have been made to predict how interesting a video is [1]–[3]. While they have come up with promising results in gender representation estimation task, the work leaves several open questions. B. State of the Art and Preliminary Work Rapid growth of video contents online have acceler- ated the current research in video content analysis based on multimodal features. Several aﬀective content-based video scene extraction schemes have been studied to map low-level features of video data to high-level emotional events. Multiple media modalities including audio and visual cues are exploited in [4] for detection of se- mantically meaningful scenes from feature ﬁlms. Hidden Markov Models (HMM) based Video Aﬀective Content and Audio Emotional Event have been explored in [5], [6]. Valence features together with emotion intensity are used for HMM-based emotion type identiﬁcation in [7]. Detection of attention-invoking audiovisual segments is formulated in [7] on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Analyzing a multimedia content at an aﬀective level reveals information that describes its emotional value or scene intensity. Computable video features namely, average shot length, color variance, motion content and lighting key are exploited in [8]. One of the most recent work [3] considers three factors: shot length, motion energy, and harmonicity. Signiﬁcant exploitation of cinematic principles and video features is