2023 International Conference on Network, Multimedia and Information Technology (NMITCON)
979-8-3503-0082-6/23/$31.00 ©2023 IEEE
Enhancing Object Detection with Mask R-CNN: A
Deep Learning Perspective
1
st
Kamepalli Sujatha
Department of IT and CA
Vignan’s Foundation for Science
Technology& Research
Guntur, India
sujatha_kamepalli@yahoo.com
2
nd
Kommineni Amrutha
Department of IT and CA
Vignan’s Foundation for Science
Technology& Research
Guntur, India
kommineniamrutha@gmail.com
3
rd
N.Veeranjaneyulu
Department of IT and CA
Vignan’s Foundation for Science
Technology& Research
Guntur, India
veeru2006n@gmail.com
Abstract— Object identification is a critical task in the field
of machine learning, focusing on locating and recognizing
specific elements of interest within an image. A mask-based R-
CNN model using the ResNet-50-FPN backbone, which has
been already trained on the data set COCO as a feature
extractor is used in this study of author’s innovative method
for object detection. In order to train several neural network
algorithms for object recognition, which include CNN, VGG-
16, and Inception Net, pertinent features from the input photos
are first extracted using the pre-trained models. The proposed
method improved accuracy in object detection, as well as the
ability to efficiently extract relevant features using pre-trained
models. The three frameworks were assessed on two different
datasets: Pascal VOC and COCO. For both the datasets the
VGG-16 model achieves high accuracy 89.9% and 95.4%
respectively. The results of the experiment show that the
suggested approach is beneficial, with the model achieving of
higher accuracy in object detection compared to other existing
methods. This research contributes to the development of
efficient and adaptable frameworks in order to recognize and
separate things in photographs which are essential for the
automation of machine vision systems.
Keywords—Machine vision, Object detection, Segmentation,
Mask R-CNN, Deep learning, Region proposal network.
I. INTRODUCTION
Machine vision is an essential field of study that has
revolutionized various industries and technologies include
driving without assistance, automation, and medical
purposes. It relies on computer algorithms to analyze and
interpret digital images or videos, enabling machines to
perceive their surroundings depend on what you believe, and
act as accordingly. Object detection is a fundamental
component of machine vision, where the task revolves
around the identification and precise localization of objects
within an image or video stream (Meimetis et al., 2023).
Due to the variety in how the objects seem, identifying
them can be difficult, shape, size, and orientation, as well as
the presence of occlusions and clutter in real-world scenes.
In recent years, machine learning has become a potent
technique for recognizing of objects with various cutting-
edge architectures developed for this job. (He et al., 2017)
announced the Mask R-CNN architecture as one design in
2017 by including a branch for object mask prediction in
addition to the boundaries and class predictions, Mask R-
CNN expands on the success with the quicker R-CNN
architecture with this strategy of two distinct models a region
proposal network (RPN) for generating object proposals and
a fully convolutional network for predicting object classes
and refining object boundaries. The result is an accurate and
efficient object detection and segmentation framework that
can be easily adapted to various applications (Dhruva et al.,
2021) (Ren et al., 2017).
The task of Instance Segmentation involves accurately
identifying and segmenting all the objects in an image,
including their individual instances. This process combines
the elements from both semantic segmentation and object
detection in order to locate and classify each object using a
bounding box. A traditional method for semantic
segmentation groups each pixel into pre-defined categories
without distinguishing between separate instances, which can
result in suboptimal outcomes. However, this paper
demonstrates of existing the most advanced segmentation
techniques may be outperformed by a straightforward,
adaptable, and effective framework (Pont-Tuset et al., 2017).
A. Mask R-CNN Model
For object recognition and segmentation in computer
vision, the deep neural network Mask R-CNN (Mask
Region-based Convolutional Neural Network) is used. It is a
development of the Faster R-CNN (Region-based
Convolutional Neural Network) approach, that employs a
region-based proposal network (RPN) and detection layer in
tandem to recognize suggested objects and categorize them
according to predetermined categories. Faster R-CNN
serves as the foundation for Mask R-CNN, which expands
on it by adding a third branch for object mask prediction in
addition to the current branches for object identification and
classification (Sujatha & Srinivasa Rao, 2019). This
additional branch predicts a segmentation mask for each
detected object, which gives more accurate information
about the spatial location of the object than just a bounding
box. Mask R-CNN uses a convolutional neural network
(CNN) backbone, such as ResNet or VGG, to identify
characteristics in the image, that are next used by RPN and
detection of network to generate object proposals and
classify them. The mask prediction uses the feature maps
produced by the backbone to forecast a mask of binary data
for every proposed item (Ren et al., 2017).
B. Architecture of Mask R-CNN Model
Mask R-CNN's structure comprises two phases, which
are as follows: The Region Proposal Network (RPN), which
produces region proposals, is the initial step and the second
stage is the Mask Head, which refines the object proposals
and generates object masks. A picture is an input into the
RPN, which is a completely convolutional network, and a
collection of item suggestions are the results, along with their
objectless scores. These proposals are then refined using a
2023 International Conference on Network, Multimedia and Information Technology (NMITCON) | 979-8-3503-0082-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/NMITCON58196.2023.10276033
ed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on October 19,2023 at 02:27:35 UTC from IEEE Xplore. Restriction