Journal of
Information Systems Engineering
and Business Intelligence
Vol.6, No.1, April 2020
Available online at: http://e-journal.unair.ac.id/index.php/JISEBI
ISSN 2443-2555 (online) 2598-6333 (print) © 2020 The Authors. Published by Universitas Airlangga.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
doi: http://dx.doi.org/10.20473/jisebi.6.1.9-17
Lexicon-Based Indonesian Local Language Abusive Words
Dictionary to Detect Hate Speech in Social Media
Mardhiya Hayaty
1) *
, Sumarni Adi
2)
, Anggit D. Hartanto
3)
1)2)3)
Faculty of Computer Science, Universitas Amikom Yogyakarta, Indonesia
Ring Road Utara Condong Catur, Sleman
1)
mardhiya_hayati@amikom.ac.id,
2)
sumarni.a@amikom.ac.id,
3)
anggit@amikom.ac.id
Article history:
Received 15 January 2020
Revised 5 March 2020
Accepted 12 March 2020
Available online 28 April 2020
Keywords:
Abusive Words
Dictionary base
Hate Speech
Lexicon base
Abstract
Background: Hate speech is an expression to someone or a group of people that contain
feelings of hate and/or anger at people or groups. On social media users are free to express
themselves by writing harsh words and share them with a group of people so that it triggers
separations and conflicts between groups. Currently, research has been conducted by
several experts to detect hate speech in social media namely machine learning-based and
lexicon-based, but the machine learning approach has a weakness namely the manual
labelling process by an annotator in separating positive, negative or neutral opinions takes
time long and tiring
Objective: This study aims to produce a dictionary containing abusive words from local
languages in Indonesia. Lexicon-base is very dependent on the language contained in
dictionary words. Indonesia has thousands of tribes with 2500 local languages, and 80% of
the population of Indonesia use local languages in communication, with the result that a
significant challenge to detect hate speech of social media.
Methods: Abusive words surveys are conducted by using proportionate stratified random
sampling techniques in 4 major tribes on the island of Java, namely Betawi, Sundanese,
Javanese, Madurese
Results: The experimental results produce 250 abusive words dictionary from 4 major
Indonesian tribes to detect hate speech in Indonesian social media by using the lexicon-
based approach.
Conclusion: A stratified random sampling technique has been conducted in 4 major
Indonesian tribes to produce 250 abusive words for hate speech detection using the
lexicon-based approach.
I. INTRODUCTION
Hate speech is a speech toward someone or a group of people that contain hate or anger toward them [1]. There is
a relationship of a language [2] with a strong argument to lead to one's opinion so that it can predict the onset of
social conflict. At present, people's dependence on internet connections is very high, especially the use of social
media [3], provocation is very easily spread and can influence someone to commit illegal acts.
The number of internet user in Indonesia always increase every year. The number of active Internet users in
Indonesia is 143 millions users based on the Indonesian Service User Association survey in 2018. On social media,
users send their expression and write bad words freely, insulting words, offensive words or hate. The word or
sentence is shared with a specific group or individuals can trigger hatred and separation.
On sentiment analysis, hate speech is a negative sentiment. The algorithms, such as Support Vector Machine,
Naive Bayes, Random Forest Decision Tree, can be used to do opinion classification and analysis[5]. However,
detection of hate speech not only to match word-to-word, but also to every language has different informal form and
grammar.
Indonesia has 1340 tribes, and Javanese spread in almost every territory of Indonesia, which is 40% of the
Indonesian population. Other than that, Indonesia has a lot of local languages, and there are 2500 local languages.
*
Corresponding author