State-of-the-Art Vietnamese Word Segmentation Song Nguyen Duc Cong Computer Science Department Assumption University Bangkok, Thailand st5429822@au.edu Quoc Hung Ngo University of Information Technology Ho Chi Minh City, Vietnam hungnq@uit.edu.vn Rachsuda Jiamthapthaksin Computer Science Department Assumption University Bangkok, Thailand rachsuda@scitech.au.edu Abstract—Word segmentation is the first step of any tasks in Vietnamese language processing. This paper reviews state- of-the-art approaches and systems for word segmentation in Vietnamese. To have an overview of all stages from building corpora to developing toolkits, we discuss building the corpus stage, approaches applied to solve the word segmentation and existing toolkits to segment words in Vietnamese sentences. In addition, this study shows clearly the motivations on building cor- pus and implementing machine learning techniques to improve the accuracy for Vietnamese word segmentation. According to our observation, this study also reports a few of achivements and limitations in existing Vietnamese word segmentation systems. Index Terms—Vietnamese word segmentation; text modelling; Vietnamese corpus I. I NTRODUCTION Lexical analysis, syntactic analysis, semantic analysis, dis- closure analysis and pragmatic analysis are five main steps in natural language processing [1], [2]. While morphology is a basic task in lexical analysis of English, word segmentation is considered a basic task in lexical analysis of Vietnamese and other East Asian languages processing. This task is to determine borders between words in a sentence. In other words, it is segmenting a list of tokens into a list of words such that words are meaningful. Word segmentation is the primary step in prior to other natural language processing tasks i. e., term extraction and linguistic analysis (as shown in Figure 1). It identifies the basic meaningful units in input texts which will be processed in the next steps of several applications. For named entity recognization [3], word segmentation chunks sentences in input documents into sequences of words before they are further classified in to named entity classes. For Vietnamese language, words and candidate terms can be extracted from Vietnamese copora (such as books, novels, news, and so on) by using a word segmentation tool. Conformed features and context of these words and terms are used to identify named entity tags, topic of documents, or function words. For linguistic analysis, several linguistic features from dictionaries can be used either to annotating POS tags or to identifying the answer sentences. Moreover, language models can be trained by using machine learning approaches and be used in tagging systems, like the named entity recognization system of Tran et al. [3]. Many studies forcus on word segmentation for Asian lan- guages, such as: Chinese, Japanese, Burmese (Myanmar) and Thai [4], [5], [6], [7]. Approaches for word segmentation task are variety, from lexicon-based to machine learning-based methods. Recently, machine learning-based methods are used widely to solve this issue, such as: Support Vector Machine or Conditional Random Fields [8], [9]. In general, Chinese is a language which has the most studies on the word segmentation issue. However, there is a lack of survey of word segmentation studies on Asian languages and Vietnamese as well. This paper aims reviewing state-of-the-art word segmentation approaches and systems applying for Vietnamese. This study will be a foundation for studies on Vietnamese word segmentation and other following Vietnamese tasks as well, such as part-of- speech tagger, chunker, or parser systems. Figure 1. Word segmentation tasks (blue) in the Vietnamese natural lanuage processing system. There are several studies about the Vietnamese word seg- mentation task over the last decade. Dinh et al. started this task with Weighted Finite State Transducer (WFST) approach and Neural Network approach [10]. In addition, machine learning approaches are studied and widely applied to natural language processing and word segmentation as well. In fact, several studies used support vector machines (SVM) and conditional random fields (CRF) for the word segmentation task [8], [9]. Based on annotated corpora and token-based features, studies used machine learning approaches to build word segmentation systems with accuracy about 94%-97%. According to our observation, we found that is lacks of complete review approaches, datasets and toolkits which we recently used in Vietnamese word segmentation. A all sided arXiv:1906.07662v1 [cs.CL] 18 Jun 2019