Bulletin of Electrical Engineering and Informatics Vol. 13, No. 5, October 2024, pp. 3601~3608 ISSN: 2302-9285, DOI: 10.11591/eei.v13i5.7443 3601 Journal homepage: http://beei.org Enhanced building footprint extraction from satellite imagery using Mask R-CNN and PointRend Ahmed NourEldeen 1 , Mohamed E. Wahed 2 1 Department of Mathematics and Computer Science, Faculty of Science, Suez University, Suez, Egypt 2 Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt Article Info ABSTRACT Article history: Received Aug 19, 2023 Revised Mar 22, 2024 Accepted Mar 29, 2024 The extraction of building footprints from aerial photos and satellite imagery plays a crucial role in change detection, urban development, and detecting encroachments on agricultural land. Deep neural networks offer the capability of extracting features and provide accurate methods for detecting and extracting building footprints from satellite imagery. Image segmentation, the process of dividing an image into coherent parts, can be accomplished using two types: semantic segmentation and instance segmentation. Convolutional neural networks (CNN) are commonly used for both instance and semantic segmentation tasks. In this paper, we propose a hybrid approach to extracting building footprints from low-resolution satellite imagery using instance segmentation techniques. Our analysis demonstrates that the mask region-based CNN (R-CNN) architecture with a ResNet-34 backbone and PointRend head to improve the bounding-boxes and mask prediction achieves the highest performance, as evidenced by various metrics, including an average precision (AP) score of 83.39% and an F-1 score of 85.71%. This approach holds promise for developing automated tools to process satellite imagery, benefiting fields such as agriculture, land use monitoring, and disaster response. Keywords: Artificial intelligence Building footprint extraction Convolutional neural network Deep learning Satellite imagery This is an open access article under the CC BY-SA license. Corresponding Author: Ahmed NourEldeen Department of Mathematics and Computer Science, Faculty of Science, Suez University Suez, Egypt Email: ahmednour_cs@yahoo.com 1. INTRODUCTION Building footprint extraction from satellite imagery is used in many geographic information systems (GIS) solutions such as disaster assessment, geospatial analysis, regional planning, population growth estimation, and change detection. Deep learning models have become the most technique used for computer vision problems in satellite imagery and the GIS field [1]–[3]. The deep learning models use a multi-layer neural network architecture to learn the features with different levels of abstraction [4]. Deep neural networks employ complex linear and nonlinear operations to create a layered architecture, extracting features from input data. Convolutional neural networks (CNNs) are frequently employed in computer vision, particularly for tasks like object detection. The region-based CNN (R-CNN) model was presented in [5]. which generates region proposals from the image using a search algorithm, then feeds these regions into a CNN for feature extraction, and utilizes a support vector machine (SVM) to classify the bounding boxes [6]. Fast R-CNN [7] use CNN to extract the features and cropping the region proposals with the feature map to generate the region of interest (RoI). Fast R-CNN uses the fully connected layer and softmax for bounding boxes localization and classification [8]. You only live once (YOLO) [9] used another technique that splits the image into grids with