6 The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents in Jawi Characters SULIANA SULAIMAN, Sultan Idris Education University, Malaysia KHAIRUDDIN OMAR, NAZLIA OMAR, MOHD ZAMRI MURAH, and HAMDAN ABDUL RAHMAN, Universiti Kebangsaan Malaysia The Malay language has two types of writing script, known as Rumi and Jawi. Most previous stemmer results have reported on Malay Rumi characters and only a few have tested Jawi characters. In this article, a new Jawi stemmer has been proposed and tested for document retrieval. A total of 36 queries and datasets from the transliterated Jawi Quran were used. The experiment shows that the mean average precision for a “stemmed Jawi” document is 8.43%. At the same time, the mean average precision for a “nonstemmed Jawi” document is 5.14%. The result from a paired sample t-test showed that the use of a “stemmed Jawi” document increased the precision in document retrieval. Further experiments were performed to examine the precision of the relevant documents that were retrieved at various cutoff points for all 36 queries. The results for the “stemmed Jawi” document showed a significantly different start, at a cutoff of 40, compared with the “nonstemmed Jawi” documents. This result shows the usefulness of a Jawi stemmer for retrieving relevant documents in the Jawi script. Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing— Language models; Language parsing and understanding; Text analysis; H.3.4 [Information Storage and Retrieval]: Systems and Software—Performance evaluation (efficiency and effectiveness); H.3.1 [Informa- tion Storage and Retrieval]: Content Analysis and Indexing—Linguistic General Terms: Languages, Performance Additional Key Words and Phrases: Jawi stemmer, Malay stemmer, Jawi document retrieval, stemming ACM Reference Format: Sulaiman, S., Omar, K., Omar, N., Murah, M. Z., and Rahman, H. A. 2014. The effectiveness of a Jawi stemmer for retrieving relevant Malay documents in Jawi characters. ACM Trans. Asian Lang. Inform. Process. 13, 2, Article 6 (June 2014), 21 pages. DOI:http://dx.doi.org/10.1145/2540988 1. INTRODUCTION Stemming in Malay is more complex than in English. The Malay language has two different types of script: the Jawi script and the Rumi script. Jawi is an Arabic-script- based orthography. Jawi is based on Arabic, and Rumi is a Roman-based script. Jawi is read from right to left and has different forms of characters. For example, the word “king” in Malay can be written as “ ” in the Jawi or “Raja” in the Rumi. The Jawi script was used as early as 674 [Nasruddin et al. 2008]. It is also used as a writing system in the Malay archipelagos. Jawi has also been used as an art form to perform Islamic calligraphy. This type of calligraphy can be seen in architecture, where walls are decorated using the Jawi Authors’ addresses: S. Suliana (corresponding author), Faculty of Art Computing and Creative Industry, Sultan Idris Education University, Tanjung Malim, Perak Darul, Ridzuan 35900, Malaysia; email: ssuliana@yahoo.com; K. Omar, N. Omar, M. Z. Murah, and H. A. Rahman, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub- lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. c 2014 ACM 1530-0226/2014/06-ART6 $15.00 DOI:http://dx.doi.org/10.1145/2540988 ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.