Synthesizing the Unseen for Zero-Shot Object Detection Nasir Hayat 1(B) , Munawar Hayat 1,2 , Shaﬁn Rahman 3 , Salman Khan 1,2 , Syed Waqas Zamir 1 , and Fahad Shahbaz Khan 1,2 1 Inception Institute of Artiﬁcial Intelligence, Abu Dhabi, UAE nh2218@nyu.edu 2 MBZ University of AI, Abu Dhabi, UAE {munawar.hayat,salman.khan,fahad.khan}@mbzuai.ac.ae 3 North South University, Dhaka, Bangladesh Abstract. The existing zero-shot detection approaches project visual features to the semantic domain for seen objects, hoping to map unseen objects to their corresponding semantics during inference. However, since the unseen objects are never visualized during training, the detection model is skewed towards seen content, thereby labeling unseen as back- ground or a seen class. In this work, we propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. Consequently, the major challenge becomes, how to accurately synthesize unseen objects merely using their class semantics? Towards this ambitious goal, we propose a novel genera- tive model that uses class-semantics to not only generate the features but also to discriminatively separate them. Further, using a uniﬁed model, we ensure the synthesized features have high diversity that represents the intra-class diﬀerences and variable localization precision in the detected bounding boxes. We test our approach on three object detection bench- marks, PASCAL VOC, MSCOCO, and ILSVRC detection, under both conventional and generalized settings, showing impressive gains over the state-of-the-art methods. Our codes are available at https://github.com/ nasir6/zero shot detection. Keywords: Zero-shot object detection · Generative adversarial learning · Visual-semantic relationships 1 Introduction Object detection is a challenging problem that seeks to simultaneously local- ize and classify object instances in an image [1]. Traditional object detection methods work in a supervised setting where a large amount of annotated data is used to train models. Annotating object bounding boxes for training such models is a labor-intensive and expensive process. Further, for many rare occur- ring objects, we might not have any training examples. Humans, on the other hand, can easily identify unseen objects solely based upon the objects’ attributes c  Springer Nature Switzerland AG 2021 H. Ishikawa et al. (Eds.): ACCV 2020, LNCS 12624, pp. 155–170, 2021. https://doi.org/10.1007/978-3-030-69535-4_10