Fast and Accurate Convolutional Object Detectors for Real-time Embedded Platforms Min-Kook Choi hutom mkchoi@hutom.io Jaehyung Park DGIST stillrunning@dgist.ac.kr Heechul Jung KNU heechul@knu.ac.kr Jinhee Lee DGIST jhlee07@dgist.ac.kr Soo-Heang Eo hutom sooheang@hutom.io Abstract With the improvements in the object detection networks, several variations of object detection networks have been achieved impressive performance. However, the perfor- mance evaluation of most models has focused on detection accuracy, and the performance verification is mostly based on high-end GPU hardwares. In this paper, we propose real-time object detectors that guarantees balanced perfor- mance for real-time system on embedded platforms. The proposed model utilizes the basic head structure of the Re- fineDet model, which is a variant of the single shot object detector (SSD). In order to ensure real-time performance, CNN models with relatively shallow layers or fewer param- eters have been used as the backbone structure. In addi- tion to the basic VGGNet and ResNet structures, various backbone structures such as MobileNet, Xception, ResNeXt, Inception-SENet, and SE-ResNeXt have been used for this purpose. Successful training of object detection networks was achieved through an appropriate combination of in- termediate layers. The accuracy of the proposed detector was estimated by the evaluation of MS-COCO 2017 object detection dataset and the inference speed on the NVIDIA Drive PX2 and Jetson Xaviers boards were tested to verify real-time performance in the embedded systems. The ex- periments show that the proposed models ensure balanced performance in terms of accuracy and inference speed in the embedded system environments. In addition, unlike the high-end GPUs, the use of embedded GPUs involves several additional concerns for efficient inference, which have been identified in this work. The codes and models are publicly available on the web (link). Fast R-CNN (VGG-16) CoupleNet (ResNet-101) SSD300* (VGG-16) YOLOv2 SSD512*(VGG-16) RefineDet320 (VGG-16) RefineDet512 (VGG-16) rRefineDet320 (VGG-16) rRefineDet512 (VGG-16) rRefineDet320 (ResNet-18) rRefineDet512 (ResNet-18) rRefineDet320 (ResNeXt26) rRefineDet320 (ResNeXt50) rRefineDet320 (Inception-SENet) rRefineDet320 (MobileNetV2) rRefineDet512 (ResNeXt50) rRefineDet320 (SE-ResNeXt50) 15 20 25 30 35 40 45 0 10 20 30 40 50 60 70 Frames per second (fps) Mean AP (mAP) Figure 1. Speed (fps) versus accuracy (mAP) on MS-COCO test-dev14 or 17. Out models (red color) have a balanced speed and accuracy compared to the existing real-time-oriented models (purple color). Performance of the proposed models were mea- sured on the NVIDIA Titan XP. Details of the performance mea- surements are described in Section 4. 1. Introduction In recent years, the performance of object detection has dramatically improved due to the emergence of object de- tection networks that utilize the structure of CNNs [20]. Owing to this improvement, CNN-based object detectors have potential practical applications such as video surveil- lance [15], autonomous navigation [29], machine vision [31], and medical imaging [14]. Several industries have been making efforts to implement these technological ad- vancements in conjunction with industrial applications. Object detection methods using the CNN structure are broadly classified into two types: The first is a learning method by classifying class information, which locates and classifies objects in an image using a one-stage training method, by including them into one network stream and di- viding the encoded features into different depth dimensions. Representative models of this one-stage network are YOLO arXiv:1909.10798v1 [cs.CV] 24 Sep 2019