A Pointwise Approach for Vietnamese Diacritics Restoration Tuan Anh Luu Kazuhide Yamamoto Department of Electrical Engineering, Nagaoka University of Technology {anh,yamamoto}@jnlp.org Abstract—The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However, sometimes the resulting string is also a word, possibly with different grammatical properties or a different meaning, and this makes recovery of the missing diacritics a difficult task for software as well as for human readers. This paper is the first to study automatic diacritic restoration in Vietnamese texts. Modern Vietnamese is a complex language with many diacritical marks, and white space does not always function as a word separator. This paper proposes a pointwise approach for automatically recovering missing diacritics, using three features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. Our experiments show that the proposed method can recover diacritics with a 94.7% accuracy rate. Keywords- Vietnamese, automatic diacritic restoration, pointwise approach, natural language processing, classification. I. INTRODUCTION Spell checking, which involves detecting and correcting spelling errors, is one of the most common natural language processing applications. The most frequent errors are orthographic and typing errors. However, there is an additional category of spell checking that is needed for most European languages (although not for English) and for some African and Asian languages: the restoration of diacritics. Automated restoration of diacritics is useful for reconstructing legacy texts that were typeset without diacritics. In addition, it is needed for a growing number of contemporary texts that lack diacritics, mainly because there is no accepted standard for encoding diacritics and users therefore find it easier to omit them when they type. This practice is especially common in casual forms of electronic communication such as e-mails, posts on discussion forums, and chats. Thus, missing diacritics pose a serious problem not only for automatic text processing and information retrieval, but also for human readers. There are two basic approaches to diacritic restoration: word-based and character-based [8]. Word-based approaches are usually implemented as knowledge- intensive systems that rely on dictionaries and statistical language models and are therefore language dependent. These approaches require large corpuses of grammatically correct text in order to build useful models, and they require considerable preprocessing time for tokenization, tagging, and other tasks. In contrast, character-based systems use language-independent algorithms based on statistical information that has been learned from training data. For languages in which diacritics signal grammatical or semantic roles, word-based systems are much more reliable than character-based systems [3]. In general, the choice between the two approaches for restoring diacritics will depend on several factors: the role of diacritics in the targeted language, the availability of adequate training data, the processing speed that is required, and user requests and needs. Like European languages, modern Vietnamese uses the Latin alphabet. However, in addition to the characters used in English, Vietnamese has letters that are modified with diacritics: đ, ă, â, ê, ô, ơ, and ư; and it is necessary to use an Input Method Editor (IME) to enter these special characters in electronic texts. However, IMEs are slow, difficult to install and use. Therefore, many Vietnamese choose to use non-diacritical Vietnamese, which can be entered using any computer and is easier and quicker to type. However, non-diacritical Vietnamese is difficult to understand and can be very confusing. Word-based approaches to Vietnamese language processing face two major challenges: there are not enough textual data (such as dictionaries and corpuses), and the Vietnamese language does not have a word separator (this is a problem because word-based approaches must preprocess word segmentations). Phuong (2007) [10] reported a 97% accuracy rate for word segmentation of diacritical Vietnamese. However, the accuracy rate will be considerably lower for non-diacritical Vietnamese, where word segmentation is much more complex and difficult. The abundance of diacritics along with the absence of a word separator also make Vietnamese a difficult language for traditional character-based restoration, which can be expected to yield a high degree of accuracy only for languages whose diacritics can be restored without examining the context [7]. A new and powerful approach to restoring diacritics is therefore needed for the Vietnamese language. This is the first research to address the problem of restoring diacritical marks in non-diacritical Vietnamese texts. We propose a pointwise approach that automatically restores missing diacritics using three types of features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. The pointwise approach is simple, powerful, and relatively robust to rare cases that may occur in the text, with little reduction in accuracy [1]. The rest of this paper is organized as follows. Section II presents a brief overview of Vietnamese orthography and some statistical data. Section III describes our approach to restoring diacritics, and Section IV shows the 2012 International Conference on Asian Language Processing 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.18 189