International Journal of Computer Science Trends and Technology (IJCST) – Volume 6 Issue 3, May - June 2018 ISSN: 2347-8578 www.ijcstjournal.org Page 205 Development of an Arabic Image Description System Rasha Mualla [1] , Jafar Alkheir [2] Department of Computer and Control Engineering, University of Tishreen Latakia - Syria ABSTRACT Image description models are one of the most important and trend topics in the machine-learning field. Recently, many researches develop systems for image classification and description in many languages especially English. Arabic language had not taken any attention in this field. In this research, two image description models will be introduced; the first one is English- based model while the other is the Arabic-based model. In our study, CNN deep learning networks are used for image feature extraction. In the training stage, LSTM networks are chosen for their ability of memorizing previous words in image description sentences. LSTM are fed with two inputs; the first one is the image features and the other is the image description file. A new JSON image description file for Arabic description model is built, and the research uses a subset of flickr8k dataset consisting of 1500 training images, 250 validation images and 250 test ones. Performance evaluation is computed via Bleu-n and many other metrics for comparison between Arabic and English models. Keywords :— Machine Learning, Deep Learning, Image Description, CNN, LSTM, Arabic Description, JSON. I. INTRODUCTION Due to the increment of layers needed in traditional networks, some real problems are emerging. One of these problems is the additional training time occurring in case of using gradient descent training algorithms like backpropagation. In such algorithms, the gradient decreases significantly with the increasing of network layers (depth), and this causes to slow down the training process [1],[2],[3],[4]. Another problem is the over-fitting in which the training process speeds and terminates early, so the network responds well for the training samples but worse for the test ones. This problem is due to the increment of neurons in each layer [33],[40],[46],[47]. In addition to all those problems, the redundant neurons and layers results in additional weights and biases which should be learned. The full connectivity of neurons in each layer with the neurons with other layers increases the complexity of these networks. Those problems are solved by using deep learning via different situations like minimizing the interconnections between neurons form layer to another. Another way to reduce the training time is using fixed weights and biases through the network. Due to increasing the number of layers, the deep neural networks use a pooling layer which reduces the dimension of data. This will reduce the training time and extract the most important information from input data [32],[33],[34],[40]. Another deep learning characteristic is the recursive ability. It makes the networks memorize some of previous data which will be benefit for other layers. This memory makes the network useful for application in which the output depends on some input data (i.e. language producing and translation) [18]. At the last ten years, many studies have achieved which use different types of deep learning neural networks in field of image description such as convolutional neural networks (CNN) for image classification, recurrent neural networks (RNN) for image description and Long-short term memory (LSTM) for long memory dependency image description systems. The remaining paper is organized as follows: in Section II, previous and related studies are introduced. Then, the materials and methods used in this study are described including the proposed image description model in section III. Section IV deals with the results discussion. The paper ends with a conclusion. II. RELATED WORK There are many studies in the field of image classification and description. While some of them use symbolic datasets [14],[19],[28],[37], others depend on natural datasets [31],[40] The first studies at this field focus only on the image component detection. Some studies like [22] detect the components by rounding a box for each component, others [7],[10] define the location of humans in the images. Studies like [23] define objects depending on faces and bodies. Another study [10] uses Caltech dataset which consists of 35000 images. It contains also information about objects and components in image without the need of detection process. Another research [12] uses Pascal VOC dataset which includes 20 classes taken from 1100 images. Sun dataset [13] consists of 908 different classes and 3819 sub classes. It consists of 7971 images of disk class, 20213 images of wall class and 16080 images of window. It contains less number for other classes such as boat, plane, ground and light. The famous MS COCO dataset [40] defines the properties of image components from side view of object detection and labelling as well as relationships between images components. By using 200 classes corresponding to 400000 images form ImageNet dataset, the study [40] detects location of 350000 components using bounding box method. RESEARCH ARTICLE OPEN ACCESS