(IJCSIS) International Journal of Computer Science and Information Security, Vol. 14, No. 3, March 2016 Current Moroccan Trends in Social Networks Abdeljalil EL ABDOULI, Abdelmajid CHAFFAI, Larbi HASSOUNI, Houda ANOUN, Khalid RIFI RITM Laboratory, CED Engineering Sciences Ecole Superieure de Technologie Hassan II University of Casablanca, Morocco Abstract— The rapid development of social networks during the past decade has lead to the emergence of new forms of communication and new platforms like Twitter and Facebook. These are the two most popular social networks in Morocco. Therefore, analyzing these platforms can help in the interpretation of Moroccan society current trends. However, this will come with few challenges. First, Moroccans use multiple languages and dialects for their daily communication, such as Standard Arabic, Moroccan Arabic called “Darija”, Moroccan Amazigh dialect called “Tamazight”, French, and English. Second, Moroccans use reduced syntactic structures, and unorthodox lexical forms, with many abbreviations, URLs, #hashtags, spelling mistakes. In this paper, we propose a detection engine of Moroccan social trends, which can extract the data automatically, store it in a distributed system which is the Framework Hadoop using the HDFS storage model. Then we process this data, and analyze it by writing a distributed program with Pig UDF using Python language, based on Natural Language Processing (NLP) as linguistic technique, and by applying the Latent Dirichlet Allocation (LDA) for topic modeling. Finally, our results are visualized using pyLDAvis, WordCloud, and exploratory data analysis is done using hierarchical clustering and other analysis methods. Keywords: distributed system; Framework Hadoop; Pig UDF; Natural Language Processing; Latent Dirichlet Allocation; topic modeling; pyLDAvis; wordcloud; exploratory data analysis; hierarchical clustering. I. INTRODUCTION Twitter and Facebook platforms that are part of people connected life are considered the most popular platforms. According to the latest statistics, there are 936 million daily active users just for Facebook, with 83% outside the USA [1], and 316 million monthly active users for Twitter, with 77% outside USA [2]. In some countries, these two platforms have grown very fast. For instance, in Morocco, the country concerned by our research work in this article, Facebook has grown by 590,000 Moroccan users between January and October 2011 [3]. While the number of Moroccan user accounts in Twitter reached 26,666 and their number of tweets reached 780,000 by month, thus occupying the third place in the Arab world [3]. If we talk about services that made these two platforms popular we can say that Twitter is user-friendly and allows users to write short messages that can be limited to 140 characters called “tweets” in which they can post links or share images. On the other hand, Facebook allows users to create personal profiles, add other users as friends, and exchange messages, including status updates, moreover, users can share photos, links, and personal thoughts. These statistics encouraged us to lead a study that aims the analysis of messages published by Moroccan users, on these two platforms despite the difficulties quoted before. In this paper, we propose a detection engine of Moroccan social trends, that can handle the generated data by Moroccan users to create a text corpus useful to analyze and visualize the Moroccan society trends in a chosen period. To build our detection engine; we rely on the Hadoop framework, which is a distributed system, usually used to realize an infrastructure for storage and processing [4]. This infrastructure is composed of four parts. The first one is the data extraction part which handles the streaming of data, related to Morocco society, from both platforms Twitter and Facebook, and then stores the data in our distributed system using the HDFS storage model [5]. The second part handles the processing of collected data. It starts by converting tweets in JSON format using JAQL (JSON Query Language) [6], and then proceeds in search of pertinent information contained in these data. For Facebook, we use an API wrapper written in Java programming language to handle the JSON format and extract the posts and comments directly from the Graph API of Facebook. Then we apply a distributed program based on the Natural Language Processing [22] to this data by running a Pig UDF [7] written in Python language. The third part use the previous result to generate the LDA corpus [14]. The fourth part is composed of visualization tools, such as pyLDAvis, WordCloud, and other exploratory data analysis like hierarchical clustering. This paper is organized as follows. In Section II, we introduce some related works. In Section III, we present the tools and methods used in our system. In Section IV, we describe the architecture of the detection engine of Moroccan social trends. In section V, we run an experiment. We end with a conclusion in Section VI. II. RELATED WORK The analysis of social network platforms has been the focus of interest of many researches. For example, H. Kwak, C. Lee, H. Park and S. Moon [8] found that Twitter users sometimes broadcast news before traditional media. In 2011, J. Weng and B.-S. Lee [9] proposed a method based on the frequency of terms presented daily in the corpus. The frequency of each term is represented as a signal. More so, O. Ozdikis, P. Senkul and H. Oguztuzun [10] performed a semantic expansion of the terms presented in the tweets. Finally, H. Becker, M. Naaman and L. Gravano [11], proposed an approach to identify, among https://dx.doi.org/10.6084/m9.figshare.3153892 86 https://sites.google.com/site/ijcsis/ ISSN 1947-5500