Copyright © Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comparison of CSPDarkNet53, CSPResNeXt-50, and
EfficientNet-B0 Backbones on YOLO V4 as Object Detector
Marsa Mahasin, Irma Amelia Dewi
Department of Informatics, National Institute of Technology, Bandung, Indonesia
*Corresponding author E-mail: mahasingrad@mhs.itenas.ac.id
Manuscript received 15 April 2022; revised 1 May 2022; accepted 15 June 2022. Date of publication 25 July 2022
Abstract
YOLO v4 has a structure consisting of 3 parts: backbone, neck, and head. The backbone is a part of the YOLO v4 structure that serves as
a feature extractor from the image; the backbone is also a convolutional neural network that can be replaced with another convolutional
neural network. Many backbones are recommended by previous research, such as CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0.
Therefore, research needs to be done to determine the effect of different backbones on the YOLO v4 model. One of the research objects
that can be used is a microfossil. Research on the detection of microfossils is fundamental to assist paleontologists in knowing the species
of microfossils as a determinant of rock age and distinguishing between similar microfossils. In this research three backbones consisting
of CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0 were used to train and detect image sets of 5 species of foraminiferal microfossils
and the results were evaluated to determine the advantages of each backbone. There are a few metrics are that being used for evaluation,
namely precision, recall, f1-score, average precision (AP), mean average precision (mAP), frames per second (FPS), and model size. As a
result, the mean average precision (mAP) of the CSPDarkNet53 model reached 83.41%, the highest compared to CSPResNeXt-50 and
EfficientNet-B0 which get a value of 81,00% and 81,76%. CSPResNeXt-50 model has a precision of 75.60%, recall of 81.10%, and f1-
score of 78%. CSPDarkNet53 model also got the highest FPS value of 33.4FPS. However, the YOLO v4 model with the EfficientNet-B0
backbone is the lightest model with only 156.8 MB.
Keywords: YOLO, CSPDarkNet53, CSPResNeXt-50, EfficientNet-B0, Microfossil
1. Introduction
You Only Look Once (YOLO) is an algorithm based on convolutional neural network that is often used for object detection and object
classification. The structure of this algorithm consists of several parts such as backbone, neck, and head, each of which has a different
function. The YOLO algorithm has advantages over other one-stage detectors such as RetinaNet and Single Shot Multibox Detector (SSD)
when used in real-time conditions where the YOLO algorithm produces larger frames per second (FPS) and a lighter model size so that the
detection capability is better [1]. The backbone of the YOLO algorithm acts as a feature extractor from the input image. The types of
backbones are CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0. These are backbones that can be used in the YOLO algorithm and
produce up to 70% accuracy while still detecting objects up to 40 FPS [2].
Microfossil objects are closely related to biostratigraphy, namely the science of determining the age of rocks using the fossils contained
therein. The complex morphology of microfossils requires the use of specialists for correct systematics, especially to produce detailed and
accurate biostratigraphic correlations [3]. Education and training in identifying microfossils is dwindling but technological developments
allow the possibility of accelerating and standardizing the characterization and identification of fossils by machine learning [4]. the latest
research on microfossil images classified using the Convolutional Neural Network (CNN) with 7 different models, namely VGG-19,
Inception-ResNetV2, MobileNetV2, ResNet50, Xception, NASNetMobile, and DenseNet121. This study concludes that CNN with the
ResNet50 model has the greatest accuracy of 81.8%, 76.7% precision, and 71.4% recall [5].
The backbone of the YOLO algorithm acts as a feature extractor from the input image. In research on YOLO v4 entitled "YOLOv4:
Optimal Speed and Accuracy of Object Detection", RetinaNet, EfficientDet-D0, RFBNet, NAS-FPN, ATSS, RDSNet, CenterMask, LRF,
Faster R-CNN, M2det, SSD, and TridentNet tested with YOLO v4. The result, RetinaNet and EfficientDet-D0 achieved FPS and AP values
closer to YOLO v4 than other detectors. YOLO v4 with CSPDarkNet53 backbone scored 96 FPS and 41.2% AP. EfficientDet-D0 scored
62.5 FPS and 33.8% AP. While RetinaNet scored 37 FPS and an AP of 37%. The CSPResNeXt-50 backbone is used on RetinaNet and
EfficientNet-B0 is used on EfficientDet-D0 [2].
Based on previous research, further research is needed to validate the hypotheses from previous researches and determine the effect of
using different backbones on the YOLO v4. The effect of this backbone can be evaluated by mean average precision (mAP), average
International Journal of Engineering, Science & InformationTechnology (IJESTY)
Volume 2, No. 3 (2022) pp. 64-72
ISSN 2775-2674 (online)
Website: http://ijesty.org/index.php/ijesty
DOI: https://doi.org/10.52088/ijesty.v1i4.291
Research Paper, Short Communication, Review, Technical Paper