An Evaluation of YOLO-Based Algorithms for Hand Detection in the Kitchen Joshua van Staden 1 and Dane Brown 2 Computer Science, Rhodes University Grahamstown, South Africa 1 g14v2805@campus.ru.ac.za, 2 d.brown@ru.ac.za Abstract—Convolutional Neural Networks have offered an accurate method with which to run object detection on images. Specifically, the YOLO family of object detection algorithms have proven to be relatively fast and accurate. Since its inception, the different variants of this algorithm have been tested on different datasets. In this paper, we evaluate the performances of these algorithms on the recent Epic Kitchens-100 dataset. This dataset provides egocentric footage of people interacting with various objects in the kitchen. Most prominently shown in the footage is an egocentric view of the participants’ hands. We aim to use the YOLOv3 algorithm to detect these hands within the footage provided in this dataset. In particular, we examine the YOLOv3 algorithm using two different backbones: MobileNet- lite and VGG16. We trained them on a mixture of samples from the Egohands and Epic Kitchens-100 datasets. In a separate experiment, average precision was measured on an unseen Epic Kitchens-100 subset. We found that the models are relatively simple and lead to lower scores on the Epic Kitchens 100 dataset. This is attributed to the high background noise on the Epic Kitchens 100 dataset. Nonetheless, the VGG16 architecture was found to have a higher Average Precision (AP) and is, therefore, more suited for retrospective analysis. None of the models was suitable for real-time analysis due to complex egocentric data. Index Terms—Computer Vision, Cooking, Deep Learning, Hand Detection, Object Detection I. I NTRODUCTION Interest in the Convolutional Neural Network (CNN) began escalating when the AlexNet architecture [1] won the Ima- geNet Large Scale Visual Recognition Challenge (ILSVRC) competition in 2012. Since this breakthrough, CNNs have been adapted to run object detection on images. One particularly useful algorithm for object detection has been the You Only Look Once (YOLO) algorithm [2]. This enables CNNs to detect objects at high performance exceeding 30Hz [2]. In literature concerning real-time studies, such as [3], [4] and [5], a camera with a typical number of frames per second (FPS) will output footage of 30 Hz. Therefore, we define 30Hz as real-time performance, which corresponds to an inference time of 0.33s or less per scene. A good measure of an algorithm’s ability to generalize lies in how it adapts to various datasets. This paper aims to evaluate a set of object detection models on the relatively recent Epic Kitchens-100 dataset [6]. This dataset offers an egocentric set of videos of cooking-related activities, applicable to health and lifestyle. This study was funded by National Research Foundation (120654). This work was undertaken in the Distributed Multimedia CoE at Rhodes University. In this paper, we aim to evaluate the performance of dif- ferent types of the YOLOv3 algorithm on the Epic Kitchens- 100 dataset. Specifically, we vary different backbones in the YOLOv3 algorithm. From this, we get a benchmark of these already-existing algorithms on a new dataset. We aim to examine the Average Precision (AP), FPS and Precision-Recall curves achieved by these algorithms. Section II reviews object detection systems found in litera- ture. The test metrics and experimental results are discussed in Section III and experimental results are analyzed. The paper is summarized and concluded in Section IV. II. OBJECT DETECTION Object detection aims to find various objects within an image and classify them [7]. Object detection is broken down into two components: localization and classification. Firstly, localization determines the location of objects in an image and produces a bounding box or binary mask around these objects. Secondly, classification involves determining the class of each localized object. CNNs are particularly useful at classification, but can also localize objects [7]. Two approaches to CNN- based algorithms for object detection are two-stage, such as Faster R-CNN and one-stage, such as YOLO. This section describes CNNs and details their development from the 2012 AlexNet to modern-day networks. A. Convolutional Neural Networks CNNs [8] consist of two parts – convolutional and fully- connected. The raw image is fed to the convolutional layers, which extract various features of the image. Depending on the network architecture, each convolutional layer is followed by various other layers. These layers help to further refine the information output by these layers. Typically, this layer is a max-pooling layer. The features extracted from the final convolutional layer are flattened and fed into a fully connected layer. These layers will be described individually in this subsection. 1) Convolutional: The raw image is fed into the first convolutional layer, which consists of a sliding filter [1]. The output of a convolutional layer is a feature map. Each point in the feature map corresponds to a dot product of the sliding filter with the input. Semantically, a convolution aims to filter small-scale structures within an image. The selected patterns differ depending on the architecture of the network and the