Word Sense Disambiguation for English Quranic IR System Sabrina Tiun, Hamed Zakr, Masnizah Mohd Faculty of Technology and Information Science Universiti Kebangsaan Malaysia Bangi, Malaysia sabrinatiun@ftsm.ukm.my , mas@ftsm.ukm.my Norazlinda Zainal Abidin, Ahmad Irfan Ikmal Hisham Centre for Modern Language and Human Sciences Universiti Malaysia Pahang Kuantan, Malaysia azlinda@ump.edu.my , irfan@ump.edu.my Abstract—Quranic text Information Retrieval (IR) is quite demanding yet very trivial due to that user will not always use the exact keywords to retrieve the relevant Quranic text (verse). Many have tried to overcome this problem by expanding or reformulating the query entered by users using semantic approaches with resources such as ontologies and thesauri. Word Sense Disambiguation (WSD) has been less interest to the IR research community due to the insignificant or very little significant impact on the IR performance. Recently, researchers pay interest on applying WSD to the IR problem due to the intuition that deep semantic analysis on the query process will give good impact on the IR performance. However, we have not seen any articles mentioning the use of WSD for Quranic IR, which we are assuming less or none research on WSD for Quranic IR have been carried out. Thus, this motivates us to explore WSD impact on Quranic IR performance. This paper will describe our on-going project on building an English Quranic WSD at the early stage, which is still at the proposal stage, where we layout what could be the best approach, resources and disambiguation algorithm for Quranic WSD for IR. Keywords-component; Unsupervised Word Sense Disambiguation; Quranic Translated Text; Quranic IR I. INTRODUCTION Al-Quran is a Holy book that contains the teaching of Islam, in which, the main principles of Islam and how these principles should be conducted are written. The availability of digitalised translated Quran making the work of finding written knowledge in Quran becomes faster and less complicated, especially for those who are not familiar with Arabic language. Digitalized translated Quran are available in Internet such as the websites of Islamicity.com and Tafsir.com, and there are more than 100 websites giving access to digitalised Quran [1]. These digitalised Quran usually contain translation and reciter voices, and some provide search function which users are able to obtain verses containing the entered keyword. A simple search or query function can be seen in these digitalised Quran and it is a of Information Retrieval (IR) problem in the domain of Quranic text. However, a more complex query process like a sentence-based or phrase-based query will require more than just keywords to retrieve the most relevant verses to the users. That is why Quranic text IR is quite demanding yet very trivial, since there will always be a case where users are not using the correct words to represent the knowledge they are seeking. IR researchers who are working in Quranic IR overcome the mentioned problem by expanding or reformulate the query. Works on query expanding or reformulating using hierarchical knowledge like ontology, theasuri etc, or by using Natural Language Processing (NLP) like stemming process. It is always the main interest of researchers in Quranic IR to find and to explore the new or different approaches on how to represent the users keywords matched with the knowledge (verse) they want to seek. Word Sense Disambiguation (WSD) has been of less interest to the IR research community due to the insignificant or very little significant impact on IR performance or a mix of positive and negative results. However, by applying WSD to IR, the IR accuracy should have some improvement since some believe that deep semantic analysis on the query process will give good impact on the IR performance, theoretically, and that could be the reason why WSD has been continuously studied. Although, we can see recent works on WSD for IR but so far, we have not seen any work done in applying WSD on Quranic- domain IR and perhaps in any domain-specific IR. Therefore, in this paper we want to propose a Quranic IR model which Word Sense Disambiguation (WSD) is being implemented to aid of giving deeper semantic analysis on the user’s keywords. Since the context we want to disambiguate is a very domain- specific, thus, we assume by using supervised learning (that uses external resources like WordNet, LDOC etc) will not be suitable since it involves general domain, and this open domain, clearly will not give big impact, or worse, could reduce the IR performance. Our assumption is supported by a report of Hwee [2] that mentioned WSD in IR has positive result using unsupervised WSD. Thus, we specifically venture our study on minimal and unsupervised WSD approaches that will treat WSD more like word sense discrimination or word/context clustering for English translated Quranic text domain.