Khmer Word Segmentation based on Bi-Directional Maximal Matching for Plaintext and Microsoft Word Document Narin Bi * and Nguonly Taing † Royal University of Phnom Penh, Phnom Penh, Cambodia narin.mite229@rupp.edu.kh * and taing.nguonly@rupp.edu.kh † Abstract— One of major key component in Khmer language processing is how to transform Khmer texts into series of separated Khmer words. But unlike in Latin languages such as English or French; Khmer language does not have any explicit word boundary delimiters such as blank space to separate between each word. Moreover, Khmer language has more complex structure to word form which causes Khmer Unicode standard ordering of character components to permit different orders that lead to the same visual representation; exactly looking word, but different character order. Even more, Khmer word could also be a join of two or more Khmer words together. All these complications address many challenges in Khmer word segmentation to determine word boundaries. Response to these challenges and try to improve level of accuracy and performance in Khmer word segmentation, this paper presents a study on Bi- directional Maximal Matching (BiMM) with Khmer Clusters, Khmer Unicode character order correction, corpus list optimization to reduce frequency of dictionary lookup and Khmer text manipulation tweaks. The study also focuses on how to implement Khmer word segmentation on both Khmer contents in Plaintext and Microsoft Word document. For Word document, the implementation is done on currently active Word document and also on file Word document. The study compares the implementation of Bi-directional Maximal Matching (BiMM) with Forward Maximal Matching (FMM) and Backward Maximal Matching (BMM) and also with similar algorithm from previous study. The result of study is 98.13% on accuracy with time spend of 2.581 seconds for Khmer contents of 1,110,809 characters which is about 160,000 of Khmer words. Keywords: Bi-directional Maximal Matching, Forward Maximal Matching, Backward Maximal Matching , Word Segmentation, Khmer Unicode, Khmer Cluster, Khmer Unicode I. INTRODUCTION Word segmentation is one of the most basic but crucial task in analysis of languages without word boundary delimiters. In Natural language processing (NLP) like information retrieval, machine translation, speech processing, etc.; words boundary is very important. On basic, word segmentation based on two approaches: supervised and unsupervised. Particularly, supervised segmentation approach can be achieved a very high precision on the targeted knowledge domain with the help of training corpus, the manually segmented text collection. On the other hand, unsupervised segmentation approach where there is no or limited training data available. The resulting segmentation accuracy with unsupervised approach may not be very satisfying, but the effort of creating the training data set is not absolutely required. Bi-directional Maximal Matching (BiMM), the supervised segmentation approach, is fundamentally the combination of Forward Maximal Matching (FMM) and Backward Maximal Matching (BMM) algorithm which solve ambiguity problems caused by greedy characteristic of Maximal Matching algorithm when using alone; either FMM or BMM. II. RELATED WORK There are many works related to text segmentation are proposed with many new techniques based on supervised and unsupervised segmentation approach for languages like Chinese, Japanese, Thai, and Vietnamese. For Khmer word segmentation, the paper on Khmer Unicode Text Segmentation using Maximal Matching [2], use Maximal Matching algorithm combined with Khmer Cluster, Khmer Unicode Sorting technique based on decimal code, and hash table for dictionary look-up. With these techniques, Khmer text segmentation could achieve up to 97% segmentation accuracy and using 0.2321 second; where the result comes from the experiment of 10 random documents from newspapers. The paper on Word Bigram vs. Orthographic Syllable Bigram in Khmer Word Segmentation [7], discusses the word segmentation of Khmer written text based on Bigram model. The research carries out two extended methods from the Bigram Model, Word Bigram and Orthographic syllable Bigram. The sound similarity errors identification of their previous research are combined to improve the accuracy of the segmentation which based on the maximum matching algorithm for the word segmentation. Many texts with different genres were collected and manually segmented. The size of the training corpus is 10.6 MB with 673295 words and 20991 vocabularies. Some random texts were selected from the newspaper and the book for evaluating the accuracy of approach. The text is about 190 KB in size with 13025 words. For experimentation, they compare the automatic segmented text with the manual segmented text. According to the experimentation, Word Bigram outperforms the Orthographic syllable Bigram approach (see Table I). 978-616-361-823-8 © 2014 APSIPA APSIPA 2014