Data 2022, 7, 83. https://doi.org/10.3390/data7070083 www.mdpi.com/journal/data
Article
Instagram‐Based Benchmark Dataset for Cyberbullying
Detection in Arabic Text
Reem ALBayari
1,2,
* and Sherief Abdallah
2
1
Higher College of Technology, Abu Dhabi P.O. Box 25026, United Arab Emirates
2
Faculty of Engineering and IT, The British University in Dubai, Dubai P.O. Box 345015,
United Arab Emirates; sherief.abdallah@buid.ac.ae
* Correspondence: ralbayari@hct.ac.ae; Tel.: +971‐556780224
Abstract: (1) Background: the ability to use social media to communicate without revealing one’s
real identity has created an attractive setting for cyberbullying. Several studies targeted social media
to collect their datasets with the aim of automatically detecting offensive language. However, the
majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were col‐
lected, none focused on Instagram despite being a major social media platform in the Arab world.
(2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a
benchmark, we use SPSS (Kappa statistic) to evaluate the inter‐annotator agreement (IAA), as well
as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB).
(3) Results: in this research, we present the first Instagram Arabic corpus (sub‐class categorization
(multi‐class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of de‐
tecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments
were annotated by three human annotators. The results show that the SVM classifier outperforms
the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive
comments.
Keywords: cyberbullying; offensive language; Arabic dialect
1. Introduction
Throughout the recent years, the number of social media users has grown dramati‐
cally. Facebook, Twitter, Instagram, and many more platforms provide a perfect place for
users to express their thoughts and interact with other users from different cultures and
backgrounds. Unfortunately, this enriching social experience also provides a fertile envi‐
ronment for cyberbullying [1,2]. Cyberbullying is defined as the use of telecommunica‐
tions to disseminate abusive behavior such as messages, images, or videos with the aim
of causing harm to others [3]. Such a toxic environment can cause hate crimes and psy‐
chological harm [4]. This provoked the necessity for automatic detection of offensive and
abusive speech over social media platforms. Due to the negative effects of cyberbullying
[5,6], the recent years have seen an increasing number of studies that collected and anno‐
tated datasets related to cyberbullying from different social media platforms [7–9]. Since
the social media data comes from the users’ minds, it provides an unprecedented oppor‐
tunity for studying cognitive processes such as perception, personality, and information
spread [10]. For instance, the authors in [11] conducted an online survey and discovered
the granular functional impact of social media in supporting a positive impression of
stresses throughout the pandemic. As a result of collective resilience, this study provides
an empirically verified theoretical framework for understanding the emergence of social
media buffering mechanisms.
In addition to that, social media presents challenges since it necessitates the develop‐
ment of appropriate interpretable frameworks that can highlight the structure of
Citation: ALBayari, R.; Abdallah, S.
Instagram‐Based Benchmark Dataset
for Cyberbullying Detection in
Arabic Text. Data 2022, 7, 83.
https://doi.org/10.3390/data7070083
Academic Editor: Joaquín
Torres‐Sospedra
Received: 21 May 2022
Accepted: 21 June 2022
Published: 22 June 2022
Publisher’s Note: MDPI stays neu‐
tral with regard to jurisdictional
claims in published maps and institu‐
tional affiliations.
Copyright: © 2022 by the authors. Li‐
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con‐
ditions of the Creative Commons At‐
tribution (CC BY) license (https://cre‐
ativecommons.org/licenses/by/4.0/).