Semantic Comparison of State-of-the-Art Deep Learning Methods for Image Multi-Label Classification Adam Kubany, Shimon Ben Ishay, Ruben-sacha Ohayon, Armin Shmilovici, Lior Rokach, Tomer Doitshman Department of Software and Information System Engineering Ben-Gurion University of the Negev, Israel ABSTRACT Image understanding relies heavily on accurate multi- label classification. In recent years deep learning (DL) algorithms have become very successful tools for multi-label classification of image objects. With these set of tools, various implementations of DL algorithms have been released for the public use in the form of application programming interfaces (API). In this study, we evaluate and compare 10 of the most prominent publicly available APIs in a best-of-breed challenge. The evaluation is performed on the Visual Genome labeling benchmark dataset using 12 well- recognized similarity metrics. In addition, for the first time in this kind of comparison, we use a semantic similarity metric to evaluate the semantic similarity performance of these APIs. In this evaluation, Microsoft’s Computer Vision, TensorFlow, Imagga, and IBM’s Visual Recognition showed better performance than the other APIs. Furthermore, the new semantic similarity metric allowed deeper insights for comparison. Keywords: multi-label classification comparison, deep learning, image understanding, semantic similarity I. INTRODUCTION Accurate semantic identification of objects, concepts, and labels from images is one of the preliminary challenges in the quest for image understanding. It is only natural that machine learning, and natural language researchers have been highly motivated to address these challenges. The race to achieve good label classification has been fierce and became even more so as a result of public competitions such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). 12 The obvious next step in this quest lies in the expansion of the challenge from single to multi- label classification. With this challenge in mind, different learning approaches for multi-label classification have been suggested. Tsoumakas and Katakis [1,2] divided these approaches into two main categories: 1) problem transformation methods which consist of the learning methods that transform the problem into one or more single-label classification problems, and then transform the results into multi- label representation; and 2) algorithm adaptation 1 ILSVRC ImageNet Large Scale Visual Recognition Competition 2 http://image-net.org/challenges/LSVRC/ methods which consist of the learning methods which try to solve the multi-label prediction problem as a whole directly from the data. In 2012, Madjarov et al. [3] introduced a third category of methods, referred to as ensemble methods; this category consists of methods that combine classifiers to solve the multi- label classification problem. In this approach, each of the base classifiers in the ensemble can belong to either the problem transformation or algorithm adaptation methods category. As the research field of multi-label classification advances, more effective approaches have been developed [3,4]. In recent years, deep learning methods, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs) and their variations, have demonstrated excellent performance in visual and multi-label classification [5-14]. Some of the more successful methods have been published as APIs for public use. The more salient approaches were published by research groups from Imagga, 3 WatsonIBM, 4 Clarifai, 5 Microsoft, 6 Wolfram Alpha, 7 Google, 8 Caffe, 9 DeepDetect, 10 OverFeat, 11 and TensorFlow 12 . With these recent publications, the need for a best-of-breed performance comparison has arisen. While some comparisons between multi-label classification methods have been performed in the past [3,4], none of them included the latest deep learning approaches. In this study, we address this need and evaluate the performance of 10 state of the art deep learning approaches. A benchmark comparison is best accomplished by evaluating them with a state-of-the- art dataset. For that purpose, we chose the Visual Genome dataset [15], which includes rich metadata and semantic annotations on multi-domain everyday images. We evaluate and compare these 10 approaches with well-established multi-label evaluation metrics 3 https://imagga.com/solutions/auto-tagging.html 4 http://www.ibm.com/watson/developercloud/visual- recognition.html 5 https://www.clarifai.com/ 6 https://www.microsoft.com/cognitive-services/en- us/computer-vision-api 7 https://www.imageidentify.com/ 8 https://cloudplatform.googleblog.com/2015/12/Google- Cloud-Vision-API-changes-the-way-applications-understand- images.html 9 http://caffe.berkeleyvision.org/ 10 http://www.deepdetect.com/ 11 http://cilvr.nyu.edu/doku.php?id=software:overfeat:start 12 https://www.tensorflow.org/versions/r0.9/tutorials/image_rec ognition/index.html