Abstract—A part-of-speech tagger as signs the correct grammatical category to each word in a given text based on the context surrounding the word. This paper presents Mi-POS, a Malay language Part-of-Speech tagger that is developed using a probabilistic approach with information about the context. The results of benchmarking Mi-POS against several similar systems are also presented in this paper and the lessons learnt from it are highlighted. The dataset used for evaluation consists of manually annotated texts. The authors used the accuracy and time to measure the results of this evaluation. The final results show that Mi-POS outperforms other Malay Part-of-Speech taggers in terms of accuracy with an accuracy of 95.16% obtained by tagging new words from the same training corpus type and 81.12% for words from different corpora types. Index Terms—Benchmarking, Malay language, natural language processing, part-of-speech tagging. I. INTRODUCTION Part-of-speech (POS) tagging is an important process that is used to build many Natural Language Processing (NLP) applications. The POS tagger assigns a unique grammatical class to each word in a context (e.g., a sentence). However, natural language words can have different POS tags based on their contexts. This ambiguity makes the POS tagging a non-trivial process since context interpretation is essential to find the correct tag for a given word. To automate this process, machine learning techniques including statistical and probabilistic methods have been used to build powerful POS taggers. Training the machine learning models necessitates a manually-built POS-tagged corpus to be able to predict the correct tags for new words. Such corpus may be available for the major languages. However, due to the lack of linguistic resources for Malay language, this corpus needs to be constructed manually to be used to train the POS models. In this paper, a Malay POS tagger called Mi-POS is developed and compared with other existing Malay POS taggers. A manually-built corpus is constructed to train the Manuscript received December 12, 2015; revised February 29, 2016. Dickson Lukose, Khalil Bouzekri and Benjamin Chu Min Xian are with the Artificial Intelligence Lab at MIMOS Berhad, Kuala Lumpur, 57000 Malaysia (e-mail: dickson.lukose@mimos.my, khalil.ben@mimos.my, mx.chu@mimos.my). Mohamed Lubani and Liew Kwei Ping are with the University of Malaya, Faculty of Computer Science and Information Technology, Kuala Lumpur, 50603 Malaysia (e-mail: mohamed.lubani@siswa.um.edu.my, liewkweiping@siswa.um.edu.my). Rohana Mahmud is with the Department of Artificial Intelligence, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, 50603 Malaysia (e-mail: rohanamahmud@um.edu.my). models. Another two manually-built corpora are used to test the models. To this end, this paper is structured as follows: Section II describes the related work on existing POS taggers; Section III highlights the proposed model of our Mi-POS; Section IV shows all the results of the experiment; Section V discusses the results and performance of Mi-POS compared to other systems. Finally, Section VI concludes this paper with a discussion on the overall outcome achieved and future research directions. II. RELATED WORK POS tagging is widely adopted for languages such as English, German, Spanish and Arabic [1]-[4]. It plays a significant role in text analysis as it is an initial step to identify the grammar information in the text. Among the existing POS taggers are TnT Tagger [2] and Brill Tagger [5]. All of them are adopting machine learning methods and achieve accuracies of 96.7%, 97.24% and 95% respectively. The rich availability of linguistic resources is the main factor which contributes to the development of these taggers<for the European languages. However, in contrast, there is less research on POS for Malay language due to its limited resources. One Malay Tagger is developed by Mohamed [6] which applies trigram Hidden Markov Model (HMM) method to identify words‟ tags in Malay sentences. Context information other than the surrounding tags, namely the prefix and the suffix, has been used to predict the correct POS tags. His study measures the effect of using these features individually as well as using a combination of both the prefix and the suffix of each word in the final model‟s predictions. The model is tested using a corpus of 18,135 tokens tagged with a set of 21 tags similar to the set of tags used by Dewan Bahasadan Pustaka (DBP) [7]. This corpus is tagged automatically by mapping each word to a list of possible tags from a dictionary, and then the ambiguity is solved manually. The results show that the best predictions are made with accuracy 67.9% using only prefixes information with a fixed prefix length equals to three letters. Similar results with accuracy 66.7% are achieved using a combination of the first and the last three letters of each word. When using suffixes information only, the best accuracy achieved is 60% with suffix length of five letters. These findings show that HMMs are suitable models to be used to predict any Malay word‟s POS tag. On the other hand, Rayner Alfred et al. proposed a rule-based method for identifying Malay POS tags called RPOS [8]. It applies affixing and word relation rules to determine the right word category. Malay words can be formed with prefixes, suffixes, circum fixes and/or infixes. In Benchmarking Mi-POS: Malay Part-of-Speech Tagger Benjamin Chu Min Xian, Mohamed Lubani, Liew Kwei Ping, Khalil Bouzekri, Rohana Mahmud, and Dickson Lukose International Journal of Knowledge Engineering, Vol. 2, No. 3, September 2016 115 doi: 10.18178/ijke.2016.2.3.064