Multimedia Tools and Applications Audio-Visual Events for Multi-Camera Synchronization Anna Llagostera Casanovas · Andrea Cavallaro Abstract We present a multimodal method for the automatic synchronization of audio-visual recordings captured with a set of independent cameras. The proposed method jointly processes data from audio and video channels to estimate inter- camera delays that are used to temporally align the recordings. Our approach is composed of three main steps. First we extract from each recording temporally sharp audio-visual events. These audio-visual events are short and characterized by an audio onset happening jointly to a well-localized spatio-temporal change in the video data. Then, we estimate the inter-camera delays by assessing the co- occurrence of the events in the various recordings. Finally, we use a cross-validation procedure that combines the results for all camera pairs and aligns the recordings in a global timeline. An important feature of the proposed method is the estimation of the confidence level on the results that allows us to automatically reject recordings that are not reliable for the alignment. Results show that our method outperforms state-of-the-art approaches based on audio-only or video-only analysis with both fixed and hand-held moving cameras. Keywords Audio-visual processing · multiple cameras · synchronization · event detection A. Llagostera Casanovas contributed to this work while at Queen Mary University of London, UK. She was supported by the Swiss National Science Foundation under the prospective researcher fellow- ship PBELP2-137724. A. Cavallaro acknowledges the support of the UK Engineering and Physical Sciences Research Council (EPSRC), under grant EP/K007491/1 Anna Llagostera Casanovas SwissQual AG, Switzerland E-mail: anna.llagostera@swissqual.com Andrea Cavallaro Centre for Intelligent Sensing, Queen Mary University of London, UK E-mail: andrea.cavallaro@eecs.qmul.ac.uk