Building an Arabic Social Corpus for Dangerous Profile Extraction on Social Networks Amal Rekik 1,2 , Hanen Ameur 1,2 , Amal Abid 1,2 , Atika Mbarek 1,2 , Wafa Kardamine 2 , Salma Jamoussi 1,2 , Abdelmajid Ben Hamadou 1,2 1 Multimedia InfoRmation Systems and Advanced Computing Laboratory (MIRACL), Tunisia 2 Digital Research Center of Sfax (DRCS), Tunisia {rekik.amal91, ameurhanen, abidamal90, mbarek.atika91, jamoussi}@gmail.com wafabensaid2010@hotmail.com, abdelmajid.benhamadou@isimsf.rnu.tn Abstract. Social networks are considered today as revolutionary tools of communication that have a tremendous impact on our lives. However, these tools can be manipulated by vicious users namely terrorists. The process of collecting and analyzing such profiles is a considerably challenging task which has not yet been well established. For this purpose, we propose, in this paper, a new method for data extraction and annotation of suspicious users from social networks threatening the national security. Our method allows constructing a rich Arabic corpus designed for detecting terrorist users spreading on social networks. The amendment of our corpora is ensured following a set of rules defined by a domain expert. All these steps are described in details, and some typical examples are given. Also, some statistics are reported from the data collection and annotation stages as well as the evaluation of the annotated features based on the intra-agreement measurement between different experts. Keywords. Data collection, annotation guidelines, social networks, suspicious content, terrorist users, Arabic social corpus. 1 Introduction Social media have invaded our daily life providing easy tools for users to express their personal opinions and exchange information from all over the world. These networks seem to be tremendous means of communication with their ability to reach a large number of internet users. However, this heavy power of communication can easily turn destructive with the presence of malicious profiles; in other words, its use can go beyond the simple exchange of information to become a means of propaganda and recruitment of jihadists around the world. Actually, malicious users on social networks can use short messages mentioning suspicious words to target specific events. Thus, several attacks can be planned through suspicious profiles that aim to disseminate a particular agenda via creating groups adhering to their networks. For these reasons and more, nowadays, this field attracts many researchers who try to tackle this challenging issue by mining social data [1, 2]. In the literature, very few studies dealt with such terrorist data [3, 4], mainly due to the lack of resources like abnormal profile information and labeled corpus. This field is still in his early stages, and it has not been yet well established. Otherwise, despite the shared terroristic content on social networks is more likely in Arabic than in other languages, these sources endure of a big vacuity in the literature. Indeed, collecting this kind of data seems very difficult since terrorist users often try to trick others and conceal their malicious intents. To do that, researchers generally exploit intelligent tools. These tools require well annotated data. Hence, it seems very important to collect and annotate data following a set of rules defined by a domain expert. In this context, we propose a new methodology for collecting and annotating suspicious textual Computación y Sistemas, Vol. 22, No. 4, 2018, pp. 1337–1346 doi: 10.13053/CyS-22-4-3068 ISSN 2007-9737