Bringing Generalization to Deep Multi-View Detection Jeet Vora 1 Swetanjal Dutta 1 Kanishk Jain 1 Shyamgopal Karthik 2 Vineet Gandhi 1 1 CVIT, IIIT Hyderabad 2 University of T¨ ubingen {jeet.vora, swetanjal.dutta, kanishk.j}@research.iiit.ac.in, shyamgopal.karthik@uni-tuebingen.de, vgandhi@iiit.ac.in Abstract Multi-view Detection (MVD) is highly effective for oc- clusion reasoning in a crowded environment. While recent works using deep learning have made signiﬁcant advances in the ﬁeld, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and ﬁnally, iii) to new scenes. We ﬁnd that existing state-of-the-art models show poor general- ization by overﬁtting to a single scene and camera conﬁgu- ration. To address the concerns: (a) we propose a novel Gen- eralized MVD (GMVD) dataset, assimilating diverse scenes with changing daytime, camera conﬁgurations, varying num- ber of cameras, and (b) we discuss the properties essential to bring generalization to MVD and propose a barebones model to incorporate them. We perform a comprehensive set of experiments on the WildTrack, MultiViewX and the GMVD datasets to motivate the necessity to evaluate generalization abilities of MVD methods and to demonstrate the efﬁcacy of the proposed approach. The code and the proposed dataset can be found at https://github.com/jeetv/GMVD 1. Introduction “Essentially all models are wrong, but some are useful.” — George E. P. Box In this work, we pursue the problem of Multi-View De- tection (MVD), a mainstream solution for dealing with oc- clusions, especially when detecting humans/pedestrians in crowded settings. The input to MVD methods is images from multiple calibrated cameras observing the same area from different viewpoints with an overlapping ﬁeld of view. The predicted output is an occupancy map [9] in the ground plane (bird’s eye view). The solutions of MVD has evolved from classical methods [1, 3, 9], to hybrid approaches [17] to end-to-end trainable deep learning architectures [13]. Ex- pectedly, the current landscape of MVD is dominated by Figure 1. Three forms of generalization required in MVD: (a) vary- ing number of cameras, (b) different camera conﬁgurations, and (c) generalizing to new scenes. end-to-end trainable deep learning methods [12, 13, 26]. We argue that by training and testing on homogeneous data, current deep MVD methods have overlooked critical funda- mental concerns, and to render them useful, the focus should shift towards their generalization abilities. Ideally, three forms of generalization abilities are essen- tial for the practical scalability and deployment of MVD methods, which is illustrated in Fig. 1: 1. Varying number of cameras: The model should adapt to a varying number of cameras (a network trained on six camera views, should work on a setup with ﬁve cameras). 2. Varying conﬁguration: The model should not overﬁt to the speciﬁc camera conﬁguration. The performance should be similar even with altered camera positions, as long as they span the dedicated area. 3. Varying scenes: Models trained on one scene should work on another (model trained on a trafﬁc signal should work on a setup inside a university). Surprisingly, the existing deep learning-based MVD meth- ods are primarily trained and tested with the same camera conﬁguration, on the same scene, using the same number of cameras. Even the environmental conditions (time, weather, etc.) are similar across train and test splits. For instance, the 1 arXiv:2109.12227v3 [cs.CV] 1 Dec 2021