Journal of Theoretical and Applied Information Technology 29 th February 2020. Vol.98. No 04 © 2005 – ongoing JATIT & LLS ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 612 SENTIMENT ANALYSIS FOR ARABIC TWEETS DATASETS: LEXICON-BASED AND MACHINE LEARNING APPROACHES AHMAD ALOQAILY 1 , MALAK AL-HASSAN 2 , KAMAL SALAH 3 , BASIMA ELSHQEIRAT 4 , MONTAHA ALMASHAGBAH 5 1, 5 Prince Al Hussein Bin Abdullah II faculty for Information Technology, Hashemite University, P.O. Box 150459, Zarqa 13115, Jordan. 2, 4 King Abdullah II School of Information Technology, The University of Jordan, P.O Box 11942, Amman, Jordan. 3 Deanship of preparatory year and supporting studies, Imam Abdulrahman Bin Faisal University, P.O Box 1982, Dammam, Saudi Arabia E-mail: 1 aloqaily@hu.edu.jo, 2 m_alhassan@ju.edu.jo, 2 kisalah@iau.edu.sa , 4 b.shoqurat@ju.edu.jo, 5 montaha.mashagbah@yahoo.com ABSTRACT Recently, Sentiment Analysis applied to social media data has gradually become one of the significant research interest in the data mining domain due to the large volume of data available on social media networks. Sentiment Analysis is concerned with analyzing text to identify opinions or emotions and categorizing them as positive, negative or neutral. Applying sentiment analysis to short texts such as Twitter messages is a challenging task because tweets might contain a combination of formal and informal language, special characters, emojis and symbols. Therefore, it is often difficult to understand the semantics of the text and it is complex to extract the proper emotions expressed by users. In this paper, sentiment analysis approaches, namely: lexicon-based and machine learning approaches, are applied and evaluated on an Arabic tweets dataset (short texts) regarding the Syrian civil war and crises. The experimental results revealed that machine learning approaches outperformed the lexicon-based in the context of predicting the subjectivity of tweets. In terms of machine learning, five popular machine learning algorithms were applied and evaluated. According to the experimental results, the Logistic Model Trees (LMT) algorithm achieved the highest performance results, followed by the simple logistic and the SVM algorithms, respectively. The results also showed that there are enhancements in performance when utilizing feature selection approaches. Based on all performance evaluation measures, the LMT algorithms reported the best performance results (Acc= 85.55, F1= 0.92 and AUC= 0.86). Keywords: Machine Learning; Lexicon-Based Approach; Sentiment Analysis; Opinion Mining; Social Media; Twitter Datasets. 1 INTRODUCTION Nowadays, the Internet has become a valuable and useful source of information, events, news and opinions available on social media websites, such as Twitter and Facebook. Currently, Twitter has more than 330 million monthly active users [1]. Through Twitter, people can express their opinions and feelings, companies can get their clients’ feedbacks and politicians can be in touch with their constituents and increase the number of their supporters [2]. With the availability of such abundant data, the ability to investigate people’s views and opinions have become more accessible and feasible. Consequently, there is a desperate need to process, analyze and eventually extract knowledge from data as opinions concerning significant issues, entities or topics. Analyzing Twitter data, for example, is not a trivial task and depends on the semantics of tweets, which are short concise texts (maximum 140 characters). This type of analysis is called Sentiment Analysis (SA) or Opinion Mining (OM). As stated in [3] sentiment is defined as "an attitude, thought, or judgment prompted by feeling".