Preliminary experiments in multi-view video stitching Siniša Šegvi´ c * , Marko Ševrovi´ c ** , Goran Kos *** , Vladimir Stanisavljevi´ c **** and Ivan Dadi´ c ** ∗ University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia ∗∗ University of Zagreb, Faculty of Transportation Sciences, Croatia ∗∗∗ Institute of Traffic and Communications, Zagreb, Croatia ∗∗∗∗ Tilda d.o.o, Zagreb, Croatia E-mail: sinisa.segvic@fer.hr Abstract—We address the problem of stitching together the three videos acquired by a special rig consisting of three high resolution cameras. The three cameras are placed in the horizontal plane on the top of the service vehicle in a way that the fields of view of the lateral cameras overlap with the field of view of the middle camera. In the presented approach, the transformations between the common parts of the corresponding video frames are approximated by planar projective mappings. The required mappings are estimated by aligning the common parts of the three views in corresponding video frames. The experiments have been performed on production EuroRAP videos provided by our industrial partner. The obtained results confirm that the presented approach would simplify the existing road inspection procedures relying on the recorded multi-view video. I. I NTRODUCTION Research on traffic accidents in many countries clearly shows that there is an intense need for increasing the safety in road traffic. Even in developed countries with well-designed and well-maintained traffic infrastructure, adequate traffic education, and strict law enforcement, the rates of serious road traffic injuries remain unacceptably high [1]. There are several campaigns over the world which promote conservative premises such as that on average 1 out of each 500 human reactions - is plainly wrong [2]. These campaigns advocate high standards in road infrastructure construction, which would provide enough protection in order to avoid fatal consequences. One of such campaigns is being carrried out through the international programme EuroRAP [3], [4]. In the scope of the EuroRAP programme, the road safety is assessed [5], [6] by analysing video footage acquired simultaneously by three high resolution cameras. The cameras are placed in the horizontal plane on the top of the service vehicle, in a way that the fields of view of the lateral cameras overlap with the field of view of the middle camera. A typical image triple acquired by such platform is shown in Figure 1. The acquired three videos are evaluated by certified experts in traffic safety, which estimate various risk factors for each section of the assessed road. The evaluation process results in a risk map for a given road network, which allows to quantify and compare the safety of the road sections along the considered route [3]. This research has been supported by the Faculty of Transportation Sciences, University of Zagreb. Unfortunately, it has been found that the experts which perform the road traffic inspection find it qite difficult to follow three different videos at the same time. Thus, there is a necesity to find out a user-friendly arrangement of the three disjoint view onto the road scene. In this paper we take advantage of the fact that the focal points of the three cameras are quite close to each other when com- pared to the typical distances towards the imaged scene. Consequently, the transformation between the common parts of the corresponding video frames is approximated by a planar projective mapping, or homography [7]. The desired comprehensive view onto the scene is constructed by projecting pixels from the sidewise cameras onto the image plane of the middle camera (in computer vision literature this procedure is known as image stitching [8], [9]). The final result approximates an image which would be acquired by a middle camera equipped with a wide- angle lens. The paper is organized as follows. Image stitching is briefly reviewed in Section II. Section III presents some details about the EuroRAP programme (including the ge- ometry of the image acquisition rig). The employed lower level computer vision techniques are detailed in Sec- tion IV. The obtained experimental results are presented in Section V, while Section VI provides a conclusion and some directions for future work. II. I MAGE STITCHING The purpose of image stitching or image compositing is to process multiple images of the same scene in order to create a high-resolution photo-mosaic in which the seams are as smooth as possible. Today, these techniques are routinely used to produce digital maps and satellite photos. They are also embeded in many digital cameras in order to enable shooting ultra wide-angle panoramas with a conventional inexpensive lens having a horizontal field of view of less than 45 ◦ . However, despite the maturity of the lower level building blocks, image stitching still can be a challenging task, which shall also be demonstrated in the rest of this paper. Image stitching typically consists of the following two tasks: i) image alignment and ii) determining composite pixels. Image alignment tells us which pixels from origi- nal images map to a given pixel in the composite image. There are two main approaches to image alignment: feature-based and direct, or pixel-based [9]. Here we