TRAJECTORY BASED VIDEO OBJECT MANIPULATION Rajvi Shah and P J Narayanan CVIT, IIIT Hyderabad, India rajvi.shah@research.iiit.ac.in, pjn@iiit.ac.in ABSTRACT We propose an object centric representation for easy and in- tuitive navigation and manipulation of videos. Object centric representation allows a user to directly access and process ob- jects as basic video components. We demonstrate a trajectory based interface and example operations, which allow users to retime, reorder, remove or clone video objects in a ‘click and drag’ fashion. This interface is created by extracting object motion information from the video. We use object detection and tracking to obtain spatiotemporal video object tube. The corresponding object motion trajectories are represented in a 3D (x, y, t) grid. Users can navigate and manipulate video objects by scrubbing or manipulating corresponding trajecto- ries. We show some example applications of proposed inter- face like object synchronization, saliency magniﬁcation, vi- sual effects and composite video creation. Index Terms— Motion based Video Representation, In- teractive Video Composition, Object based Video Access 1. INTRODUCTION The proliferation of digital cameras has caused a tremendous increase in user created images and videos. Manipulating captured images has become a home-user’s task due to the availability of numerous easy to use photo editing tools. In comparison, video manipulation is still less common. Basic video editing platforms are easy to use, but these tools pro- vide limited functionality such as split and merge videos, add captions or audio etc. Professional video editing platforms are rich in functionality, but these tools demand high techni- cal expertise for use. Moreover, most of these tools model and represent videos as a collection of frames stacked against a timeline. Though this frame-time model is best suited for passive playback and media synchronization, it makes object centric manipulation of videos a laborious task. A na¨ıve user usually gets discouraged by complex software controls and cumbersome processing. The motivation of our work is to use computer vision tech- niques to improve usability of video manipulation interfaces. For a common user, it is more convenient to think of objects or activities as basic video entities and not the frames. We propose an object centric representation for easy and intuitive Fig. 1. Object tube model of a video navigation and temporal manipulation of video objects. We model the video as a collection of spatiotemporal object vol- umes (object tubes) placed in a 3D grid as depicted in Fig. 1. With the advancement in computer vision techniques for ob- ject detection and tracking, creating such representation with little or no human intervention is now possible. We extract motion information from the video as ex- plained in Sec. 3 and represent object trajectories in a 3D interaction grid. Users can scrub, move or modify these tra- jectories to manipulate video objects interactively. Motion based video representations are used in other video naviga- tion [4, 6, 7] and annotation [6] systems. The focus of these systems is on providing an in-scene Direct Manipulation in- terface and not on video content manipulation. Object motion information is also used in [11] to produce synopsis videos. This system combines motion information with spatiotempo- ral optimization constraints for automation. We, on the con- trary, make object motion information available for user inter- action. This allows the user to interactively produce multiple composite videos by modifying object trajectories in different ways. Proposed representation allows interaction and manipula- tion at object-level. This representation typically works for long shot videos. Such video shots are captured from sufﬁ- cient distance from the object so as to put the entire object and its activity in relation to the background. For example, surveillance videos, art performance videos, sports videos etc. In later sections, we discuss a prototype interface and as- sociated operations. We show that using these simple ‘click and drag’ operations a user can navigate, retime, reorder, re- move or clone video objects. We demonstrate a few potential applications with example scenarios and conclude with a dis- cussion of future scope.