Towards Automatic Video Structuring: Shot segmentation and Video Synthesis Salvador El´ ıas Venegas-Andraca Department of Engineering Science University of Oxford Oxford, OX1 3PJ salvador.venegas-andraca@keb.ox.ac.uk Abstract Automatic Video Structuring (AVS) is a challenging new field. Due to its novelty, it is compulsory to define research areas and objectives and, in our opinion, two plausible and necessary topics in this field are shot segmentation and shot synthesis. Thus, the objective of this paper is to show the techniques developed by the author to both detect and identify the three commonest types of shot transition, (hard cuts, fades and dissolves) as well as to produce synthetic representations of the shots obtained from an arbitrary video sequence. For shot segmentation, our techniques include the use of similarity and statistical measures combined with image registra- tion techniques for maximising measurement values. Our techniques have been evaluated on several real video sequences by comparing the results with the corresponding ground truth. We have also compared the techniques with a standard existing method developed by the MoCA Project. As for shot synthesis, we present an optical flow-based approach for motion detection, together with a projective transformation- based technique to compute image mosaics. The method is tested with real examples and both static and synoptic mosaics are shown as results. 1 Introduction Video has become a very important source of information due to the human ability to process visual information in real time. In its raw form, video is a frame-based representation of the 3D world and therefore activities such as searching and retrieving elements from different scenes can be slow and cumbersome. Unfortunately, much of the information in a video is implicit and therefore its identification, classification and retrieval are not trivial tasks. Furthermore, video data is usually highly redundant and therefore, choosing the optimal way to show information is not straightforward. Thus, it is clear that automatic detection, identification and classification methods for video sequences are required. An example of an alternative representation is the use of content-based key frames instead of frame-based representations of shots; an example of alternative browsing capabilities is to have object, scenario or human-based searches instead of the classical time-based search. The previous paragraphs are the motivation of the definition of the central problem of video structuring, which can be stated as follows: given a video sequence, create automatically a new content-based representation of data which emphasises geometric and dynamic components and provides methods for fast video search hyper-linking and view synthesis. Due to its complexity, it is wise to divide research in AVS into several topics. In our opinion, because of the very nature of video and the way human beings tend to classify frame-based visual information, the first two topics to be studied in AVS are shot segmentation and shot synthesis. As for shot segmentation, it must be noted that any video film is composed of shots, i.e., sets of concatenated frames with certain continuity among them. As we humans tend to separate a video film into shots as a natural way to explain the story and its links, we use the same criterion to give a first step towards AVS. We have developed a technique for shot segmentation and it is shown along with results in real life video sequences in section 3. A further step should include rapid access to visual information by classifying data into meaningful entities (shots, scenes, people) as well as summarising several elements found all along the video (for example, scenes). Our contribution in this area is a technique developed for shot synthesis and it is shown in section 4. 1