Data 2022, 7, 83. https://doi.org/10.3390/data7070083 www.mdpi.com/journal/data Article InstagramBased Benchmark Dataset for Cyberbullying Detection in Arabic Text Reem ALBayari 1,2, * and Sherief Abdallah 2 1 Higher College of Technology, Abu Dhabi P.O. Box 25026, United Arab Emirates 2 Faculty of Engineering and IT, The British University in Dubai, Dubai P.O. Box 345015, United Arab Emirates; sherief.abdallah@buid.ac.ae * Correspondence: ralbayari@hct.ac.ae; Tel.: +971556780224 Abstract: (1) Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were col lected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the interannotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (subclass categorization (multiclass)) focusing on cyberbullying. The dataset is primarily designed for the purpose of de tecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments. Keywords: cyberbullying; offensive language; Arabic dialect 1. Introduction Throughout the recent years, the number of social media users has grown dramati cally. Facebook, Twitter, Instagram, and many more platforms provide a perfect place for users to express their thoughts and interact with other users from different cultures and backgrounds. Unfortunately, this enriching social experience also provides a fertile envi ronment for cyberbullying [1,2]. Cyberbullying is defined as the use of telecommunica tions to disseminate abusive behavior such as messages, images, or videos with the aim of causing harm to others [3]. Such a toxic environment can cause hate crimes and psy chological harm [4]. This provoked the necessity for automatic detection of offensive and abusive speech over social media platforms. Due to the negative effects of cyberbullying [5,6], the recent years have seen an increasing number of studies that collected and anno tated datasets related to cyberbullying from different social media platforms [7–9]. Since the social media data comes from the users’ minds, it provides an unprecedented oppor tunity for studying cognitive processes such as perception, personality, and information spread [10]. For instance, the authors in [11] conducted an online survey and discovered the granular functional impact of social media in supporting a positive impression of stresses throughout the pandemic. As a result of collective resilience, this study provides an empirically verified theoretical framework for understanding the emergence of social media buffering mechanisms. In addition to that, social media presents challenges since it necessitates the develop ment of appropriate interpretable frameworks that can highlight the structure of Citation: ALBayari, R.; Abdallah, S. InstagramBased Benchmark Dataset for Cyberbullying Detection in Arabic Text. Data 2022, 7, 83. https://doi.org/10.3390/data7070083 Academic Editor: Joaquín TorresSospedra Received: 21 May 2022 Accepted: 21 June 2022 Published: 22 June 2022 Publisher’s Note: MDPI stays neu tral with regard to jurisdictional claims in published maps and institu tional affiliations. Copyright: © 2022 by the authors. Li censee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and con ditions of the Creative Commons At tribution (CC BY) license (https://cre ativecommons.org/licenses/by/4.0/).