2023 International Conference on Network, Multimedia and Information Technology (NMITCON) 979-8-3503-0082-6/23/$31.00 ©2023 IEEE Enhancing Object Detection with Mask R-CNN: A Deep Learning Perspective 1 st Kamepalli Sujatha Department of IT and CA Vignan’s Foundation for Science Technology& Research Guntur, India sujatha_kamepalli@yahoo.com 2 nd Kommineni Amrutha Department of IT and CA Vignan’s Foundation for Science Technology& Research Guntur, India kommineniamrutha@gmail.com 3 rd N.Veeranjaneyulu Department of IT and CA Vignan’s Foundation for Science Technology& Research Guntur, India veeru2006n@gmail.com Abstract— Object identification is a critical task in the field of machine learning, focusing on locating and recognizing specific elements of interest within an image. A mask-based R- CNN model using the ResNet-50-FPN backbone, which has been already trained on the data set COCO as a feature extractor is used in this study of author’s innovative method for object detection. In order to train several neural network algorithms for object recognition, which include CNN, VGG- 16, and Inception Net, pertinent features from the input photos are first extracted using the pre-trained models. The proposed method improved accuracy in object detection, as well as the ability to efficiently extract relevant features using pre-trained models. The three frameworks were assessed on two different datasets: Pascal VOC and COCO. For both the datasets the VGG-16 model achieves high accuracy 89.9% and 95.4% respectively. The results of the experiment show that the suggested approach is beneficial, with the model achieving of higher accuracy in object detection compared to other existing methods. This research contributes to the development of efficient and adaptable frameworks in order to recognize and separate things in photographs which are essential for the automation of machine vision systems. Keywords—Machine vision, Object detection, Segmentation, Mask R-CNN, Deep learning, Region proposal network. I. INTRODUCTION Machine vision is an essential field of study that has revolutionized various industries and technologies include driving without assistance, automation, and medical purposes. It relies on computer algorithms to analyze and interpret digital images or videos, enabling machines to perceive their surroundings depend on what you believe, and act as accordingly. Object detection is a fundamental component of machine vision, where the task revolves around the identification and precise localization of objects within an image or video stream (Meimetis et al., 2023). Due to the variety in how the objects seem, identifying them can be difficult, shape, size, and orientation, as well as the presence of occlusions and clutter in real-world scenes. In recent years, machine learning has become a potent technique for recognizing of objects with various cutting- edge architectures developed for this job. (He et al., 2017) announced the Mask R-CNN architecture as one design in 2017 by including a branch for object mask prediction in addition to the boundaries and class predictions, Mask R- CNN expands on the success with the quicker R-CNN architecture with this strategy of two distinct models a region proposal network (RPN) for generating object proposals and a fully convolutional network for predicting object classes and refining object boundaries. The result is an accurate and efficient object detection and segmentation framework that can be easily adapted to various applications (Dhruva et al., 2021) (Ren et al., 2017). The task of Instance Segmentation involves accurately identifying and segmenting all the objects in an image, including their individual instances. This process combines the elements from both semantic segmentation and object detection in order to locate and classify each object using a bounding box. A traditional method for semantic segmentation groups each pixel into pre-defined categories without distinguishing between separate instances, which can result in suboptimal outcomes. However, this paper demonstrates of existing the most advanced segmentation techniques may be outperformed by a straightforward, adaptable, and effective framework (Pont-Tuset et al., 2017). A. Mask R-CNN Model For object recognition and segmentation in computer vision, the deep neural network Mask R-CNN (Mask Region-based Convolutional Neural Network) is used. It is a development of the Faster R-CNN (Region-based Convolutional Neural Network) approach, that employs a region-based proposal network (RPN) and detection layer in tandem to recognize suggested objects and categorize them according to predetermined categories. Faster R-CNN serves as the foundation for Mask R-CNN, which expands on it by adding a third branch for object mask prediction in addition to the current branches for object identification and classification (Sujatha & Srinivasa Rao, 2019). This additional branch predicts a segmentation mask for each detected object, which gives more accurate information about the spatial location of the object than just a bounding box. Mask R-CNN uses a convolutional neural network (CNN) backbone, such as ResNet or VGG, to identify characteristics in the image, that are next used by RPN and detection of network to generate object proposals and classify them. The mask prediction uses the feature maps produced by the backbone to forecast a mask of binary data for every proposed item (Ren et al., 2017). B. Architecture of Mask R-CNN Model Mask R-CNN's structure comprises two phases, which are as follows: The Region Proposal Network (RPN), which produces region proposals, is the initial step and the second stage is the Mask Head, which refines the object proposals and generates object masks. A picture is an input into the RPN, which is a completely convolutional network, and a collection of item suggestions are the results, along with their objectless scores. These proposals are then refined using a 2023 International Conference on Network, Multimedia and Information Technology (NMITCON) | 979-8-3503-0082-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/NMITCON58196.2023.10276033 ed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on October 19,2023 at 02:27:35 UTC from IEEE Xplore. Restriction