(IJARAI) International Journal of Advanced Research in Artificial Intelligence, Vol. 3, No.9, 2014 24 | Page www.ijarai.thesai.org Dynamic Programming Method Applied in Vietnamese Word Segmentation Based on Mutual Information among Syllables Nguyen Thi Uyen IT Faculty - Vinh University Tran Xuan Sang IT Faculty - Vinh University Abstract—Vietnamese word segmentation is an important step in Vietnamese natural language processing such as text categorization, text summary, and automated machine translation. The problem with Vietnamese word segmentation is complicated because Vietnamese words are not always separated by a space. One word can include one or more syllables depending on the context. This paper proposes a method for Vietnamese word segmentation based on the mutual information among the syllables combined with dynamic programming. With this method, we can achieve an accuracy rate of about 90% with a raw text corpus. Keywords—Vietnamese word segmentation; dynamic programming; mutual information; Vietnamese syllables I. INTRODUCTION Word segmentation is the process to determine the boundaries between words in sentences. Words in the Vietnamese language are not always separated by blank spaces. A word may contain several syllables. The syllables are combined to form different words depending on the context of the text. Therefore, it is difficult to solve this problem automatically. Example 1: The sentence written in Vietnamese "Học sinh học sinh học" - in English "The pupils study biology". This sentence is composed of two Vietnamese syllables "học ~ study" and "sinh ~ biology" which form different words in the sentence. The correct solution should be "học sinh | học | sinh học" ~"The pupils | study | biology". One of the most difficult tasks in Vietnamese word segmentation is to determine the ambiguities of the sentence. The same sentence may have different word segmentation solutions if it is in a different context. Example 2: The sentence written in Vietnamese "Ông già đi nhanh quá" may have two different meanings. One is "The old man goes too fast", the other one is "Grandfather gets old too fast". It results in two word segmentation solutions: Ông già| đi |nhanh quá and Ông| già đi| nhanh quá. In this case, it is needed to consider the context of this sentence in order to select the best solution. Mutual information (MI) between the syllables presents the correlation of syllables to be combined as a word. The greater MI value will show the higher probability of words combination of syllables. The MI theory will be presented in more details in section 3.a. The dynamic programming technique is used to reduce the complexity of the computation. This method will be presented in section 3.b. The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 describes the proposed method. Section 4 provides the experimental results. Finally, section 5 summarizes the work of this paper. II. RELATED WORKS This section presents the previous works in Vietnamese word segmentation. A. Maximum Matching Method Maximum matching algorithm is commonly used to word segmentation problem. The idea of this method is to start at the first syllable in a text and attempt to find the longest word starting with that syllabus in the dictionary. If a word is found, the maximum matching algorithm marks a boundary at the end of the longest word, then begins the same longest match search starting at the syllable following the match. Whereas, that syllable is segmented as a word, and begins the search starting at the next syllable [Wong et al, 1996] [1]. Dinh et al, 2001 used Maximum Matching method to segment Vietnamese word [2]. However, the accuracy of word segmentation is not high. B. Transition Graph Method In this method, each syllable is represented by a vertex. The edge represents weight of connection between two syllables which is calculated based on the data training process. The transition graph will show the probability among syllables to form the words in a specific text. Nguyen et al, 2003; Pham et al, 2009 used this method to segment Vietnamese word [3,4]. C. Support Vector Machine Method(SVM) Point-wise machine learning method (SVM) is used to mark two kinds of symbols: space (word segment symbols) and underscore (linking two syllables symbol) (Luu et al, 2012) [5]. There are three basic features in point-wise methods: n-grams of syllables, n-gram types of syllables, and featured dictionary. Vietnamese language has about 70% of the words with 2 syllables, and 14% words with 3 syllables, therefore the point-wise window was set as w = 3. The author defines four types of syllables: uppercase syllables (U): Vietnamese syllables begin with capital letters. lower syllables (L): the Vietnamese syllable contains only lowercase letters. Numerical syllables (N) only consists of the digits. The other type (O): the syllables belongs to a foreign language. This research achieved 98.2% accuracy rate. However, this method needs a good featured dictionary.