Jurnal Teknologi Informasi dan Terapan (J-TIT) Vol. 7 No. 1 Januari-Juni 2020 ISSN: 2580-2291
DOI: https://doi.org/10/25047/jtit.v7i1.135 ©2020 JTIT
46
Building Related Words in Indonesian and
English Translation of Al-Qur’an Vocabulary
Based on Distributional Similarity
Rahmad Geri Kurniawan
Bachelor of Informatics
School of Computing
Telkom University
Bandung, West Java
gerikurniawan@student.telkomuniversity.ac.id
Moch Arif Bijaksana
Bachelor of Informatics
School of Computing
Telkom University
Bandung, West Java
arifbijaksana@telkomuniversity.ac.id
Abstract—The Qur'an is the Muslim holy book as the primary
source of knowledge and guidance, consisting of 114 surahs, 30
juz, and has approximately 6200 verses in it. Searching for
connections or similarities between words in the Qur'an takes a
long time to find and summarize them. There is a need for a
dictionary, encyclopedia, or thesaurus of the Al-Qur'an
vocabulary, which contains each word entry related to other
words. This study discusses the interrelations and semantic
similarities between words in the Qur'an, which aims to help in
searching between related words in them. The approach taken is a
distributional similarity which is an important part of word
embedding. Measurement of word relevance is measured by
semantic similarity which is one of the lessons learned in Natural
Language Processing (NLP). Semantic similarity measures the
closeness of word vectors using cosine similarity. The process of
changing words in vector form uses the FastText algorithm which
is a development of the Word2vec algorithm. The dataset used is
the translation of the word Al-Qur'an in English and Indonesian.
The word becomes an input in the system and then produces a
score that represents the interrelationship between words.
Evaluation of system output results using the Pearson correlation
method involving the gold standard. Evaluation of the use of the
FastText algorithm produces a correlation value of 0.3398 for
Indonesian translation corpus and 0.2326 for English translation
corpus.
Keywords— Quran, semantic similarity, Word embedding,
FastText, Pearson correlation
INTRODUCTION
The Qur'an is the holy book in Islam, which was come as the
primary source of knowledge, law, wisdom, and guidance for
Muslims. The Qur'an consists of 114 surahs, 30 juz, and 6217
verses according to the history of Abl Medina, 6210 verses
according to al-Dani's history, or 6214 verses according to
Warsy's history [1]. There is a lot of information in the Qur'an
that there are words with related meanings scattered about it.
One way to understand the Qur'an is to try to explain the
content of the verses of the Qur'an, from various
aspects of paying attention to the sequence of the verses of
the Qur'an, as stated in it [2]. Looking for similarities and
linkages of words is also needed to help explain the contents
of the Qur'anic verses.
Semantic similarities and similarities are related to
one of the areas of discussion on Natural Language
Processing (NLP), namely semantic similarity. This field
discusses the measurement of the similarity of two words
represented by similarities between related concepts in it. The
idea of semantic similarity is to identify concepts that have
the same 'characteristics'. Semantic similarity is understood
as the level of taxonomic closeness between concepts (or
terms, words). In other words, semantic similarity states how
closely two concepts (or terms, words) are taxonomic,
because they share several aspects of their meaning.
Technically, the similarity measures assess numerical scores
that measure this closeness as a function of the semantic
evidence observed in one or several sources of knowledge
[3]. In its application, for example of input systems such as
the first word "paradise" and the input of the second word
"hereafter" will produce a high output similarity value. As
humans can be interpreted, those words have the meaning of
a place of life after world life. Until now, research on
semantic similarity continues to be carried out with various
methods, some of which are Word2vec, Global Vector, and
Support Vector Machine (SVM).
In previous studies related to distributional
similarity, measurements were made of the interrelationship
of words in Arabic, using a vector-based approach. The
system built on this research produces a set of words that have
a relationship with other words using the Word2vec model.
Evaluation in the study was carried out by calculating
precision based on the corrections made by linguists from the
resulting system output [4]. Word2vec known ignoring
morphology, these methods cannot create word vectors for
new words that do not appear in the training data. Because
morphological features of words are ignored, new word
vectors cannot be obtained by comparing them with
morphologically similar words [5].
In this study, a system was built to calculate the
semantic similarity value of two input words. We use the
distributional similarity approach to capture the similarity of
semantic words and make groups of words that are similar.