(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 Image Captioning using Deep Learning: A Systematic Literature Review Murk Chohan 1 , Adil Khan 2 , Muhammad Saleem Mahar 3 Saif Hassan 4 , Abdul Ghafoor 5 , Mehmood Khan 6 Department of Computer Science Sukkur IBA University Pakistan Abstract—Auto Image captioning is defined as the process of generating captions or textual descriptions for images based on the contents of the image. It is a machine learning task that involves both natural language processing (for text generation) and computer vision (for understanding image contents). Auto image captioning is a very recent and growing research problem nowadays. Day by day various new methods are being introduced to achieve satisfactory results in this field. However, there are still lots of attention required to achieve results as good as a human. This study aims to find out in a systematic way that what different and recent methods and models are used for image captioning using deep learning? What methods are implemented to use those models? And what methods are more likely to give good results. For doing so we have performed a systematic literature review on recent studies from 2017 to 2019 from well- known databases (Scopus, Web of Sciences, IEEEXplore). We found a total of 61 prime studies relevant to the objective of this research. We found that CNN is used to understand image contents and find out objects in an image while RNN or LSTM is used for language generation. The most commonly used datasets are MS COCO used in all studies and flicker 8k and flicker 30k. The most commonly used evaluation matrix is BLEU (1 to 4) used in all studies. It is also found that LSTM with CNN has outperformed RNN with CNN. We found that the two most promising methods for implementing this model are Encoder Decoder, and attention mechanism and a combination of them can help in improving results to a good scale. This research provides a guideline and recommendation to researchers who want to contribute to auto image captioning. Keywords—Image Captioning; Deep Learning; Neural Network; Recurrent Neural Network (RNN); Convolution Neural Network (CNN); Long Short Term Memory (LSTM) I. INTRODUCTION Auto image captioning is the process to automatically generate human like descriptions of the images. It is very dominant task with good practical and industrial significance [62]. Auto Image captioning has a good practical use in industry, security, surveillance, medical, agriculture and many more prime domains. It is not just very crucial but also very challenging task in computer vision [1]. Traditional object detection and image classification task just needed to identify objects within the image where the task of Auto image captioning is not just identifying the objects but also identifying the relationship between them and total scene understanding of the image. After understanding the scene it is also required to generate a human like description of that image. Since the boost of automation and Artificial Intelligence lots of research is going on to give machine human like capabilities and reduce manual work. For machines acquiring results and accuracy as good as human in image captioning problem has always been a very challenging task. Auto image captioning is performed by following key tasks in order. At first features are extracted after proper extraction of features different objects from an image are detected, after that the relationship between objects are to be identified (i.e. if objects are cat and grass it is to be identified that if cat in on grass). Once objects are detected and relationships are identified now it is required to generate the text description, i.e. Sequence of words in orderly form that they make a good sentence according to the relationship between the image objects. To perform above key tasks using deep learning different deep learning networks are used. For Example to get visual features and objects CNN with different region proposing models like RCNN, Faster RCNN can be used and to generate text description in sequence RNN or LSTM can be used. Using these networks various different methods are developed to perform auto image captioning in various different domains. However, still, there is room for the machine to make capable enough to generate descriptions like a human [61]. . After training the Deep Learning network for image captioning to evaluate its performance various evaluation matrices like BLEU, CIDEr, and ROUGE-L exists. The purpose of this Systematic Literature Review is to study all newest Articles from 2017 to 2019 to find different methods to achieve auto image captioning in different domains, what different datasets are used to achieve the task, In which different practical domains this task is used, which technique Outperforms others and finally attains to describe the technicalities behind different networks, methods and evaluation matrices. Our study will help new researchers who want to work in this domain to attain better accuracy. We specially focused and the collection of quality articles which have been published till now. We attempt to find our different techniques presented in [1- 60] articles, find their methods strengths and weakness. Finally we attempt to summarize them to explain which technique has better performance in its particular domain. Our work mostly focuses on identifying the most popular techniques. The areas in which yet there is 278 | Page www.ijacsa.thesai.org