Copyright © Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Comparison of CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0 Backbones on YOLO V4 as Object Detector Marsa Mahasin, Irma Amelia Dewi Department of Informatics, National Institute of Technology, Bandung, Indonesia *Corresponding author E-mail: mahasingrad@mhs.itenas.ac.id Manuscript received 15 April 2022; revised 1 May 2022; accepted 15 June 2022. Date of publication 25 July 2022 Abstract YOLO v4 has a structure consisting of 3 parts: backbone, neck, and head. The backbone is a part of the YOLO v4 structure that serves as a feature extractor from the image; the backbone is also a convolutional neural network that can be replaced with another convolutional neural network. Many backbones are recommended by previous research, such as CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0. Therefore, research needs to be done to determine the effect of different backbones on the YOLO v4 model. One of the research objects that can be used is a microfossil. Research on the detection of microfossils is fundamental to assist paleontologists in knowing the species of microfossils as a determinant of rock age and distinguishing between similar microfossils. In this research three backbones consisting of CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0 were used to train and detect image sets of 5 species of foraminiferal microfossils and the results were evaluated to determine the advantages of each backbone. There are a few metrics are that being used for evaluation, namely precision, recall, f1-score, average precision (AP), mean average precision (mAP), frames per second (FPS), and model size. As a result, the mean average precision (mAP) of the CSPDarkNet53 model reached 83.41%, the highest compared to CSPResNeXt-50 and EfficientNet-B0 which get a value of 81,00% and 81,76%. CSPResNeXt-50 model has a precision of 75.60%, recall of 81.10%, and f1- score of 78%. CSPDarkNet53 model also got the highest FPS value of 33.4FPS. However, the YOLO v4 model with the EfficientNet-B0 backbone is the lightest model with only 156.8 MB. Keywords: YOLO, CSPDarkNet53, CSPResNeXt-50, EfficientNet-B0, Microfossil 1. Introduction You Only Look Once (YOLO) is an algorithm based on convolutional neural network that is often used for object detection and object classification. The structure of this algorithm consists of several parts such as backbone, neck, and head, each of which has a different function. The YOLO algorithm has advantages over other one-stage detectors such as RetinaNet and Single Shot Multibox Detector (SSD) when used in real-time conditions where the YOLO algorithm produces larger frames per second (FPS) and a lighter model size so that the detection capability is better [1]. The backbone of the YOLO algorithm acts as a feature extractor from the input image. The types of backbones are CSPDarkNet53, CSPResNeXt-50, and EfficientNet-B0. These are backbones that can be used in the YOLO algorithm and produce up to 70% accuracy while still detecting objects up to 40 FPS [2]. Microfossil objects are closely related to biostratigraphy, namely the science of determining the age of rocks using the fossils contained therein. The complex morphology of microfossils requires the use of specialists for correct systematics, especially to produce detailed and accurate biostratigraphic correlations [3]. Education and training in identifying microfossils is dwindling but technological developments allow the possibility of accelerating and standardizing the characterization and identification of fossils by machine learning [4]. the latest research on microfossil images classified using the Convolutional Neural Network (CNN) with 7 different models, namely VGG-19, Inception-ResNetV2, MobileNetV2, ResNet50, Xception, NASNetMobile, and DenseNet121. This study concludes that CNN with the ResNet50 model has the greatest accuracy of 81.8%, 76.7% precision, and 71.4% recall [5]. The backbone of the YOLO algorithm acts as a feature extractor from the input image. In research on YOLO v4 entitled "YOLOv4: Optimal Speed and Accuracy of Object Detection", RetinaNet, EfficientDet-D0, RFBNet, NAS-FPN, ATSS, RDSNet, CenterMask, LRF, Faster R-CNN, M2det, SSD, and TridentNet tested with YOLO v4. The result, RetinaNet and EfficientDet-D0 achieved FPS and AP values closer to YOLO v4 than other detectors. YOLO v4 with CSPDarkNet53 backbone scored 96 FPS and 41.2% AP. EfficientDet-D0 scored 62.5 FPS and 33.8% AP. While RetinaNet scored 37 FPS and an AP of 37%. The CSPResNeXt-50 backbone is used on RetinaNet and EfficientNet-B0 is used on EfficientDet-D0 [2]. Based on previous research, further research is needed to validate the hypotheses from previous researches and determine the effect of using different backbones on the YOLO v4. The effect of this backbone can be evaluated by mean average precision (mAP), average International Journal of Engineering, Science & InformationTechnology (IJESTY) Volume 2, No. 3 (2022) pp. 64-72 ISSN 2775-2674 (online) Website: http://ijesty.org/index.php/ijesty DOI: https://doi.org/10.52088/ijesty.v1i4.291 Research Paper, Short Communication, Review, Technical Paper