Journal of Theoretical and Applied Information Technology
29
th
February 2020. Vol.98. No 04
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
612
SENTIMENT ANALYSIS FOR ARABIC TWEETS DATASETS:
LEXICON-BASED AND MACHINE LEARNING
APPROACHES
AHMAD ALOQAILY
1
, MALAK AL-HASSAN
2
, KAMAL SALAH
3
, BASIMA ELSHQEIRAT
4
,
MONTAHA ALMASHAGBAH
5
1, 5
Prince Al Hussein Bin Abdullah II faculty for Information Technology, Hashemite University,
P.O. Box 150459, Zarqa 13115, Jordan.
2, 4
King Abdullah II School of Information Technology, The University of Jordan,
P.O Box 11942, Amman, Jordan.
3
Deanship of preparatory year and supporting studies, Imam Abdulrahman Bin Faisal University,
P.O Box 1982, Dammam, Saudi Arabia
E-mail:
1
aloqaily@hu.edu.jo,
2
m_alhassan@ju.edu.jo,
2
kisalah@iau.edu.sa ,
4
b.shoqurat@ju.edu.jo,
5
montaha.mashagbah@yahoo.com
ABSTRACT
Recently, Sentiment Analysis applied to social media data has gradually become one of the significant
research interest in the data mining domain due to the large volume of data available on social media
networks. Sentiment Analysis is concerned with analyzing text to identify opinions or emotions and
categorizing them as positive, negative or neutral. Applying sentiment analysis to short texts such as Twitter
messages is a challenging task because tweets might contain a combination of formal and informal language,
special characters, emojis and symbols. Therefore, it is often difficult to understand the semantics of the text
and it is complex to extract the proper emotions expressed by users.
In this paper, sentiment analysis approaches, namely: lexicon-based and machine learning approaches, are
applied and evaluated on an Arabic tweets dataset (short texts) regarding the Syrian civil war and crises. The
experimental results revealed that machine learning approaches outperformed the lexicon-based in the
context of predicting the subjectivity of tweets. In terms of machine learning, five popular machine learning
algorithms were applied and evaluated. According to the experimental results, the Logistic Model Trees
(LMT) algorithm achieved the highest performance results, followed by the simple logistic and the SVM
algorithms, respectively. The results also showed that there are enhancements in performance when utilizing
feature selection approaches. Based on all performance evaluation measures, the LMT algorithms reported
the best performance results (Acc= 85.55, F1= 0.92 and AUC= 0.86).
Keywords: Machine Learning; Lexicon-Based Approach; Sentiment Analysis; Opinion Mining; Social
Media; Twitter Datasets.
1 INTRODUCTION
Nowadays, the Internet has become a valuable
and useful source of information, events, news and
opinions available on social media websites, such as
Twitter and Facebook. Currently, Twitter has more
than 330 million monthly active users [1]. Through
Twitter, people can express their opinions and
feelings, companies can get their clients’ feedbacks
and politicians can be in touch with their constituents
and increase the number of their supporters [2]. With
the availability of such abundant data, the ability to
investigate people’s views and opinions have become
more accessible and feasible.
Consequently, there is a desperate need to
process, analyze and eventually extract knowledge
from data as opinions concerning significant issues,
entities or topics. Analyzing Twitter data, for
example, is not a trivial task and depends on the
semantics of tweets, which are short concise texts
(maximum 140 characters). This type of analysis is
called Sentiment Analysis (SA) or Opinion Mining
(OM). As stated in [3] sentiment is defined as "an
attitude, thought, or judgment prompted by feeling".