Panoptic Segmentation: A Review Omar Elharrouss a,∗ , Somaya Al-Maadeed a , Nandhini Subramanian a , Najmath Ottakath a , Noor almaadeed a and Yassine Himeur b a Department of Computer Science and Engineering, Qatar University, Doha, Qatar b Department of Electrical Engineering, Qatar University, Doha, Qatar ARTICLE INFO Keywords: Panoptic segmentation Semantic segmentation Instance segmentation Artiﬁcial intelligence Image segmentation Convolutional neural networks ABSTRACT Image segmentation for video analysis plays an essential role in diﬀerent research ﬁelds such as smart city, healthcare, computer vision and geoscience, and remote sensing applications. In this regard, a signiﬁcant eﬀort has been devoted recently to developing novel segmentation strategies; one of the latest outstanding achievements is panoptic segmentation. The latter has resulted from the fusion of semantic and instance segmentation. Explicitly, panoptic segmentation is currently under study to help gain a more nuanced knowledge of the image scenes for video surveillance, crowd counting, self-autonomous driving, medical image analysis, and a deeper understanding of the scenes in general. To that end, we present in this paper the ﬁrst comprehensive review of existing panoptic segmentation methods to the best of the authors’ knowledge. Accordingly, a well-deﬁned taxonomy of existing panoptic techniques is performed based on the nature of the adopted algorithms, application scenarios, and primary objectives. Moreover, the use of panoptic segmentation for annotating new datasets by pseudo-labeling is discussed. Moving on, ablation studies are carried out to understand the panoptic methods from diﬀerent perspectives. Moreover, evaluation metrics suitable for panoptic segmentation are discussed, and a comparison of the performance of existing solutions is provided to inform the state-of-the-art and identify their limitations and strengths. Lastly, the current challenges the subject technology faces and the future trends attracting considerable interest in the near future are elaborated, which can be a starting point for the upcoming research studies. The papers provided with code are available at: https://github.com/elharroussomar/Awesome-Panoptic-Segmentation 1. Introduction Nowadays, cameras, radars, Light Detection and Rang- ing (LiDAR), and sensors that capture data are highly preva- lent [1]. They are deployed in smart cities to enable the collection of data from multiple sources in real-time, and be alerted to incidents as soon as they occur [2],[3]. They are also installed in public and residential buildings for security purposes. Therefore, there is a signiﬁcant increase in the use of devices with video capturing capabilities. This leads to opportunities for analysis and inference through computer vision technology [4, 5]. This ﬁeld of study has shot up in need due to the massive amounts of data generated from these equipment and the Artiﬁcial Intelligence (AI) tools that have revolutionized computing, e.g. machine learning (ML) and deep learning (DL), especially convolutional neural net- works (CNNs). Videos and images captured contain useful information that can be used for several smart city applica- tions, such as public security using video surveillance [6, 7], motion tracking [8], pedestrian behavior analysis [9, 10], healthcare services and medical video analysis [11, 12], and autonomous driving [13, 14], etc. Current needs and research trends in this ﬁeld encourage further development. ML and big data analytic tools play an essential role in this ﬁeld. On the other hand, computer vision tasks, e.g. object detection, ∗ Corresponding author ∗∗ Principal corresponding author elharrouss.omar@gmail.com (O. Elharrouss); s\_alali@qu.edu.qa (S. Al-Maadeed); nandhini.reborn@gmail.com (N. Subramanian); gonajmago@gmail.com (N. Ottakath); n.alali@qu.edu.qa (N. almaadeed); yassine.himeur@qu.edu.qa (Y. Himeur) ORCID(s): recognition, and classiﬁcation, rely on feature extraction, labeling, and segmentation of captured videos or images predominantly in real-time [8, 15, 16]. There is a strong need for properly labeled data in the AI learning process where information can be extracted from the images for multiple inferences. The labeling of images strongly depends on the considered application. Bounding box labeling and Image segmentation are some of the ways that videos/images can be labeled, which makes the automatic labeling a subject of interest [17, 18]. Nowadays, computer vision techniques augmented the robustness and eﬃciency of diﬀerent technologies in our lives by enabling the various systems to detect, classify, recognize, and segment the content of any captured scenes using cameras. The segmentation of homogeneous regions or shapes in a scene that has similar texture or structure, such as the countable entities or objects termed things, and the uncountable regions, such as sky and roads that are termed stuﬀ, is achieved [19, 20]. Indeed, the content of the monitored scene is categorized into things and stuﬀ. There- fore, many visual algorithms are dedicated to identifying “stuﬀ” and “thing” and denote a clear division between stuﬀ and things. To that end, semantic segmentation has been introduced for identiﬁcation of this couple (things, stuﬀ) [21]. In contrast, instance segmentation is the technique that processes just the things in the image/video where an object is detected and isolated with a bounding box, or a segmentation mask [22, 23, 24]. Generally, image segmentation is the process of label- ing the image content [25, 26]. Instance segmentation and semantic segmentation are traditional approaches to current Elharrouss et al.: Preprint submitted to Elsevier Page 1 of 29