Jurnal Teknologi Informasi dan Terapan (J-TIT) Vol. 7 No. 1 Januari-Juni 2020 ISSN: 2580-2291 DOI: https://doi.org/10/25047/jtit.v7i1.135 ©2020 JTIT 46 Building Related Words in Indonesian and English Translation of Al-Qur’an Vocabulary Based on Distributional Similarity Rahmad Geri Kurniawan Bachelor of Informatics School of Computing Telkom University Bandung, West Java gerikurniawan@student.telkomuniversity.ac.id Moch Arif Bijaksana Bachelor of Informatics School of Computing Telkom University Bandung, West Java arifbijaksana@telkomuniversity.ac.id Abstract—The Qur'an is the Muslim holy book as the primary source of knowledge and guidance, consisting of 114 surahs, 30 juz, and has approximately 6200 verses in it. Searching for connections or similarities between words in the Qur'an takes a long time to find and summarize them. There is a need for a dictionary, encyclopedia, or thesaurus of the Al-Qur'an vocabulary, which contains each word entry related to other words. This study discusses the interrelations and semantic similarities between words in the Qur'an, which aims to help in searching between related words in them. The approach taken is a distributional similarity which is an important part of word embedding. Measurement of word relevance is measured by semantic similarity which is one of the lessons learned in Natural Language Processing (NLP). Semantic similarity measures the closeness of word vectors using cosine similarity. The process of changing words in vector form uses the FastText algorithm which is a development of the Word2vec algorithm. The dataset used is the translation of the word Al-Qur'an in English and Indonesian. The word becomes an input in the system and then produces a score that represents the interrelationship between words. Evaluation of system output results using the Pearson correlation method involving the gold standard. Evaluation of the use of the FastText algorithm produces a correlation value of 0.3398 for Indonesian translation corpus and 0.2326 for English translation corpus. Keywords— Quran, semantic similarity, Word embedding, FastText, Pearson correlation INTRODUCTION The Qur'an is the holy book in Islam, which was come as the primary source of knowledge, law, wisdom, and guidance for Muslims. The Qur'an consists of 114 surahs, 30 juz, and 6217 verses according to the history of Abl Medina, 6210 verses according to al-Dani's history, or 6214 verses according to Warsy's history [1]. There is a lot of information in the Qur'an that there are words with related meanings scattered about it. One way to understand the Qur'an is to try to explain the content of the verses of the Qur'an, from various aspects of paying attention to the sequence of the verses of the Qur'an, as stated in it [2]. Looking for similarities and linkages of words is also needed to help explain the contents of the Qur'anic verses. Semantic similarities and similarities are related to one of the areas of discussion on Natural Language Processing (NLP), namely semantic similarity. This field discusses the measurement of the similarity of two words represented by similarities between related concepts in it. The idea of semantic similarity is to identify concepts that have the same 'characteristics'. Semantic similarity is understood as the level of taxonomic closeness between concepts (or terms, words). In other words, semantic similarity states how closely two concepts (or terms, words) are taxonomic, because they share several aspects of their meaning. Technically, the similarity measures assess numerical scores that measure this closeness as a function of the semantic evidence observed in one or several sources of knowledge [3]. In its application, for example of input systems such as the first word "paradise" and the input of the second word "hereafter" will produce a high output similarity value. As humans can be interpreted, those words have the meaning of a place of life after world life. Until now, research on semantic similarity continues to be carried out with various methods, some of which are Word2vec, Global Vector, and Support Vector Machine (SVM). In previous studies related to distributional similarity, measurements were made of the interrelationship of words in Arabic, using a vector-based approach. The system built on this research produces a set of words that have a relationship with other words using the Word2vec model. Evaluation in the study was carried out by calculating precision based on the corrections made by linguists from the resulting system output [4]. Word2vec known ignoring morphology, these methods cannot create word vectors for new words that do not appear in the training data. Because morphological features of words are ignored, new word vectors cannot be obtained by comparing them with morphologically similar words [5]. In this study, a system was built to calculate the semantic similarity value of two input words. We use the distributional similarity approach to capture the similarity of semantic words and make groups of words that are similar.