1 Deep Multi-View Correspondence for Identity-Aware Multi-Target Tracking Adnan Hanif Air University, Islamabad, Pakistan adnanengineer@gmail.com Atif bin Mansoor University of Western Australia, Perth, Australia atif.mansoor@uwa.edu.au Ali Shariq Imran Norwegian University of Science and Technology, Gjøvik, Norway ali.imran@ntnu.no Abstract—A multi-view multi-target correspondence framework employing deep learning on overlapping cameras for identity-aware tracking in the presence of occlusion is proposed. Our complete pipeline of detection, multi-view correspondence, fusion and tracking, inspired by AI greatly improves person correspondence across multiple wide-angled views over traditionally used features set and handcrafted descriptors. We transfer the learning of a deep convolutional neural net (CNN) trained to jointly learn pedestrian features and similarity measures, to establish identity correspondence of non-occluding targets across multiple overlapping cameras with varying illumination and human pose. Subsequently, the identity-aware foreground principal axes of visible targets in each view are fused onto top view without requirement of camera calibration and precise principal axes length information. The problem of ground point localisation of targets on top view is then solved via linear programming for optimal projected axes intersection points to targets assignment using identity information from individual views. Finally, our proposed scheme is evaluated under tracking performance measures of MOTA and MOTP on benchmark video sequences which demonstrate high accuracy results when compared to other well-known approaches. Keywords—Multi-view target tracking; Deep CNN; Aggregate Channel Features; Principal axes; Assignment problem. I. INTRODUCTION In recent years, there is an enhanced need of robust visual tracking of people due to an overwhelming demand for intelligent video surveillance (IVS) at sensitive areas such as airports, train stations, parking lots, shopping malls etc. Single- camera person tracking is a computer vision problem which has a profound research history where tracking is essentially a feature matching problem from one frame to the subsequent one. To disambiguate multiple humans which are in close proximity, invariant constraints such as color features, appearance cues, motion uniformity constraints, and constant velocity assumptions are used. Much of the research in single- camera tracking has been focused on accurate detection and tracking of multiple humans under both indoor and outdoor surveillance environments with varying illumination and crowd density. However, single camera tracking of multiple humans in cluttered and crowded scenes is a challenging task primarily due to its limited field-of-view (FOV) and occlusion among people as seen from the perspective of a single camera. During past decade, the rapid proliferation of cost effective video cameras has resulted in a paradigm shift in implementation of IVS of crowded areas by tracking and maintaining the identity of multiple humans across different cameras with overlapping camera FOVs. The use of multiple cameras at different viewpoints handles occlusion in cluttered scenes by establishing multi-view correspondence across cameras to achieve target information in a fused manner. The problem of modelling correspondence across multiple views can be solved provided if camera-scene model is available. However, in real world scenarios where camera calibration information is not available or is inaccurate, it is difficult to find correspondence due to lack of distinctive target features, variations in human pose and variation in illumination at wide angled camera viewpoints. Therefore, multi-camera multi- target tracking has become a multidisciplinary research problem involving the domain of computer vision, information fusion, pattern recognition and artificial intelligence. Most recently, deep learning has been applied to various computer vision tasks such as image classification [1], object detection [2], segmentation [3] and pose estimation [4]. Its computational models such as Convolutional Neural Networks (CNNs) are composed of multiple processing layers that are able to learn representations of data with multiple levels of abstraction [5]. CNNs has gained much prominence over traditional handcrafted person descriptors like ensemble of local features [6] as well as traditional classifiers like SVM [7] for their ability to jointly learn representations and metrics [8]. One of the major application of these deep learning models is in person re-identification (Re-ID) across different camera views which has been explored in [9], [10] and mostly trained on large pedestrian datasets such as VIPeR [11], CUHK01 [12], CUHK03 [9] and Market1501 [13]. In this paper, we present a complete pipeline of detection, multi-view correspondence, fusion and tracking for multi- camera multi-target tracking. Our main contribution is twofold. (1) We employ deep transfer learning scheme for person identification across multiple overlapping camera FOVs to establish multi-view target correspondence. (2) We apply linear optimization technique to localize target ground points under unique target identity-to-intersection point association using top view projections of identity-aware principal axes. The initial foreground information on non-occluding targets only is first gathered using proven Aggregate Channel Features