Khmer Word Segmentation based on Bi-Directional
Maximal Matching for Plaintext and Microsoft Word
Document
Narin Bi
*
and Nguonly Taing
†
Royal University of Phnom Penh, Phnom Penh, Cambodia
narin.mite229@rupp.edu.kh
*
and taing.nguonly@rupp.edu.kh
†
Abstract— One of major key component in Khmer language
processing is how to transform Khmer texts into series of
separated Khmer words. But unlike in Latin languages such as
English or French; Khmer language does not have any explicit
word boundary delimiters such as blank space to separate
between each word. Moreover, Khmer language has more
complex structure to word form which causes Khmer Unicode
standard ordering of character components to permit different
orders that lead to the same visual representation; exactly
looking word, but different character order. Even more, Khmer
word could also be a join of two or more Khmer words together.
All these complications address many challenges in Khmer word
segmentation to determine word boundaries. Response to these
challenges and try to improve level of accuracy and performance
in Khmer word segmentation, this paper presents a study on Bi-
directional Maximal Matching (BiMM) with Khmer Clusters,
Khmer Unicode character order correction, corpus list
optimization to reduce frequency of dictionary lookup and
Khmer text manipulation tweaks. The study also focuses on how
to implement Khmer word segmentation on both Khmer
contents in Plaintext and Microsoft Word document. For Word
document, the implementation is done on currently active Word
document and also on file Word document. The study compares
the implementation of Bi-directional Maximal Matching (BiMM)
with Forward Maximal Matching (FMM) and Backward
Maximal Matching (BMM) and also with similar algorithm from
previous study. The result of study is 98.13% on accuracy with
time spend of 2.581 seconds for Khmer contents of 1,110,809
characters which is about 160,000 of Khmer words.
Keywords: Bi-directional Maximal Matching, Forward Maximal
Matching, Backward Maximal Matching , Word Segmentation,
Khmer Unicode, Khmer Cluster, Khmer Unicode
I. INTRODUCTION
Word segmentation is one of the most basic but crucial task
in analysis of languages without word boundary delimiters. In
Natural language processing (NLP) like information retrieval,
machine translation, speech processing, etc.; words boundary
is very important.
On basic, word segmentation based on two approaches:
supervised and unsupervised. Particularly, supervised
segmentation approach can be achieved a very high precision
on the targeted knowledge domain with the help of training
corpus, the manually segmented text collection. On the other
hand, unsupervised segmentation approach where there is no
or limited training data available. The resulting segmentation
accuracy with unsupervised approach may not be very
satisfying, but the effort of creating the training data set is not
absolutely required.
Bi-directional Maximal Matching (BiMM), the supervised
segmentation approach, is fundamentally the combination of
Forward Maximal Matching (FMM) and Backward Maximal
Matching (BMM) algorithm which solve ambiguity problems
caused by greedy characteristic of Maximal Matching
algorithm when using alone; either FMM or BMM.
II. RELATED WORK
There are many works related to text segmentation are
proposed with many new techniques based on supervised and
unsupervised segmentation approach for languages like
Chinese, Japanese, Thai, and Vietnamese.
For Khmer word segmentation, the paper on Khmer
Unicode Text Segmentation using Maximal Matching [2], use
Maximal Matching algorithm combined with Khmer Cluster,
Khmer Unicode Sorting technique based on decimal code,
and hash table for dictionary look-up. With these techniques,
Khmer text segmentation could achieve up to 97%
segmentation accuracy and using 0.2321 second; where the
result comes from the experiment of 10 random documents
from newspapers.
The paper on Word Bigram vs. Orthographic Syllable
Bigram in Khmer Word Segmentation [7], discusses the word
segmentation of Khmer written text based on Bigram model.
The research carries out two extended methods from the
Bigram Model, Word Bigram and Orthographic syllable
Bigram. The sound similarity errors identification of their
previous research are combined to improve the accuracy of
the segmentation which based on the maximum matching
algorithm for the word segmentation. Many texts with
different genres were collected and manually segmented. The
size of the training corpus is 10.6 MB with 673295 words and
20991 vocabularies. Some random texts were selected from
the newspaper and the book for evaluating the accuracy of
approach. The text is about 190 KB in size with 13025 words.
For experimentation, they compare the automatic segmented
text with the manual segmented text. According to the
experimentation, Word Bigram outperforms the Orthographic
syllable Bigram approach (see Table I).
978-616-361-823-8 © 2014 APSIPA APSIPA 2014