CRAM: Compact Representation of Actions in Movies Mikel Rodriguez Computer Vision Lab, University of Central Florida. Orlando, FL mikel@cs.ucf.edu Abstract Thousands of hours of video are recorded every second across the world. Due to the fact that searching for a partic- ular event of interest within hours of video is time consum- ing, most captured videos are never examined, and are only used in a post-factum manner. In this work, we introduce activity-specific video summaries, which provide an effec- tive means of browsing and indexing video based on a set of events of interest. Our method automatically generates a compact video representation of a long sequence, which features only activities of interest while preserving the gen- eral dynamics of the original video. Given a long input video sequence, we compute optical flow and represent the corresponding vector field in the Clifford Fourier domain. Dynamic regions within the flow field are identified within the phase spectrum volume of the flow field. We then com- pute the likelihood that certain activities of relevance occur within the the video by correlating it with spatio-temporal maximum average correlation height filters. Finally, the in- put sequence is condensed via a temporal shift optimization, resulting in a short video clip which simultaneously displays multiple instances of each relevant activity. 1. Introduction Every day millions of hours of video are captured around the world by CCTV cameras, webcams, and traffic-cams. In the United States alone, an estimated 26 million video cam- eras spit out more than four billion hours of video footage every week. In the time it takes to read this sentence, close to 20,000 hours of video have been captured and saved at different locations in the U.S. However, the vast majority of this wealth of data is never analyzed by humans. Instead, most of the video is used in an archival, post-factum manner once an event of interest has occurred. The main reason for this lack of exploitation resides in the fact that video browsing and retrieval are inconvenient due to inherent spatio-temporal redundancies, in which ex- tended periods of time contain little to no activities or events of interest. In most videos, a specific activity of interest may only occur in a relatively small region along the entire spatio-temporal extent of the video. There exists a large body of work that addresses the topic of activity recognition which focuses mainly on detection in short pre-segmented video clips commonly found in pub- licly available, standard action datasets. In this work, we at- tempt to move beyond only performing action detection in an effort to provide a means of generating a compact video representation based on a set of activities of interest, while preserving the scene dynamics of the original video. In our approach, a user specifies which activities interest him and the video is automatically condensed to a short clip which captures the most relevant events based on the user’s prefer- ence. We follow the output summary video format of non- chronological video synopsis approaches, in which differ- ent events which occur at different times may be displayed concurrently, even though they never occur simultaneously in the original video. However, instead of assuming that all moving objects are interesting, priority is given to specific activities of interest which pertain to a user’s query. This provides an efficient means of browsing through large col- lections of video for events of interest. 2. Related Work Action recognition and event classification in video have been studied extensively in recent years; a comprehensive review can be found in surveys on the topic [9, 1]. Most of the existing work can be broadly categorized into ap- proaches based on tracking [18, 4], interest points [11], ge- ometrical models of human body parts [6], 3D information [13], volumetric space-time shapes [21], action clustering [10] and temporal-templates [2]. A common theme in all of these approaches is their fo- cus on detection. That is, given a learned model of an ac- tion class, emphasis is placed on detecting instances of the learned action within small testing clips typically found in standard action datasets. After performing detection, most methods do not go beyond placing a bounding box delim- iting the spatio-temporal extent of the detected action. Our present work aims at moving beyond detection by examin- ing the role of action recognition in efficient video repre- 1