(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 SDCT: Multi-Dialects Corpus Classification for Saudi Tweets Afnan Bayazed 1 , Ola Torabah 2 , Redha AlSulami 3 , Dimah Alahmadi 4 , Amal Babour 5 , Kawther Saeedi 6 Information Systems Department, King Abdulaziz University Jeddah, Saudi Arabia Abstract—There is an increasing demand for analyzing the contents of social media. However, the process of sentiment analysis in Arabic language especially Arabic dialects can be very complex and challenging. This paper presents details of collecting and constructing a classified corpus of 4180 multi-dialectal Saudi tweets (SDCT). The tweets were annotated manually by five native speakers in two stages. The first stage annotated the tweets as Hijazi, Najdi, and Eastern based on some Saudi regions. The second stage annotated the sentiment as positive, negative, and natural. The annotation process was evaluated using Kappa Score. The validation process used cross validation technique through eight baseline experiments for training different classifier models. The results present that the 10-folds validation provides greater accuracy than 5-folds across the eight experiments and the classification of the Eastern dialects achieved the best accuracy compared to the other dialects with an accuracy of 91.48%. Keywords—Arabic dialects; dialects classification; language classification; natural language processing; Saudi dialects; sentiment analysis; Twitter I. INTRODUCTION Today, there are roughly 6500 spoken languages around the world, and each language involves different multiple dialects [1]. Arabic language is one of the most used languages in the world. Arabic is the official language of 22 countries, and it is spoken by over 400 million people. It is considered the fourth language used the most on the Internet [2]. There are three varieties of Arabic language which are Classical Arabic (CA), Modern Standard Arabic (MSA) and Arabic dialects (AD). The CA is a form of Arabic language used in literary texts and the Quran (Islam’s Holy Book). The MSA is the essential Arabic form that is used commonly in formal conversations, media, education, newspapers, magazines, and formal TV programs. The AD is used in informal communication, and it is divided by geographical region [3]. The AD geographical regions are Egyptian, North Africa, Levantine, Iraqi, and Gulf [1]. However, the Gulf region consists of six countries: Saudi Arabia, United Arab Emirates, Qatar, Kuwait, Bahrain, and Oman, where each country has its own dialect. As for Saudi Arabia, also each different region has its own dialect. In Saudi Arabia, the dialects are Hijazi in the western region, Najdi in the Middle region, Southern dialect in the Southern region, Northern dialect in the Northern region, and Eastern dialect in eastern region. The AD has huge differences between them that can be considered different languages; therefore, Arabic language and its dialects required further intensive study and analysis. Most of Arabic Natural Language processing (NLP) applications are dedicated to the MSA like sentiment analysis, machine translation, speech recognition, and speech synthesis. Moreover, the Arabic NLP tools such as part-of-speech (POS) tagging, morphological analysis, and disambiguation are designed specifically for MSA, and for that, it gave a less accurate result for AD. The Arabic NLP resources are focused on the MSA that has covered all orthographic varieties and have a rich morphology, and a strong syntactic system. As for the AD, the Arabic NLP resources do not cover it as well as the MSA. Furthermore, the AD is spoken languages with no writing system. Creating resources for the Arabic dialects is challenging in the Arabic NLP but it is necessary [4-6]. Particularly with the proliferation amounts of textual data on social media websites and microblogs, such as Twitter and Facebook, there is a huge resource for the Arabic dialects. Social media is an important communication tool for people to write about their daily life, share information, add reviews or opinions, explore the latest news and search for real-time news events. Arabic users tend to communicate with each other using the unstructured and ungrammatical slang Arabic language. Twitter is one of the world’s most popular platforms for internet users. Twitter users send about 500 million tweets per day, where each tweet contains 280 characters [7]. The Arab people have been influenced by the recent evolution in technology. The total number of Arabic users on Twitter are more than 11 million, with 27.4 million tweets per day. The most active users are from Saudi Arabia with about 30% of all the tweets [8]. Al-Twairesh et al. in [9] claims that the lack of Arab corpora is one of the challenges facing a sentiment analysis of Arab. Accordingly, this research aims to utilize the huge Arabic textual data and prepare it as language resources for the Saudi dialects. This paper’s contributions can be summarized as follows: • Build Saudi Dialects Corpus from Twitter called SDCT and make it available as an open source for the research community. • Classify SDCT depending on different Saudi dialects (Hijazi, Najdi and Eastern). • Provide sentiment labelling of each dialect mostly to Positive, Negative and Neutral. The paper is organized as follows: The previous related work is described in Section II. Section III explains the 216 | Page www.ijacsa.thesai.org