A Pointwise Approach for Vietnamese Diacritics Restoration
Tuan Anh Luu
Kazuhide Yamamoto
Department of Electrical Engineering, Nagaoka University of Technology
{anh,yamamoto}@jnlp.org
Abstract—The automatic insertion of diacritics in
electronic texts is necessary for a number of languages,
including French, Romanian, Croatian, Sindhi,
Vietnamese, etc. When diacritics are removed from a
word and the resulting string of characters is not a
word, it is easy to recover the diacritics. However,
sometimes the resulting string is also a word, possibly
with different grammatical properties or a different
meaning, and this makes recovery of the missing
diacritics a difficult task for software as well as for
human readers. This paper is the first to study
automatic diacritic restoration in Vietnamese texts.
Modern Vietnamese is a complex language with many
diacritical marks, and white space does not always
function as a word separator. This paper proposes a
pointwise approach for automatically recovering
missing diacritics, using three features for
classification: n-grams of syllables, n-grams of syllable
types, and dictionary word features. Our experiments
show that the proposed method can recover diacritics
with a 94.7% accuracy rate.
Keywords- Vietnamese, automatic diacritic restoration,
pointwise approach, natural language processing,
classification.
I. INTRODUCTION
Spell checking, which involves detecting and correcting
spelling errors, is one of the most common natural
language processing applications. The most frequent
errors are orthographic and typing errors. However, there
is an additional category of spell checking that is needed
for most European languages (although not for English)
and for some African and Asian languages: the restoration
of diacritics. Automated restoration of diacritics is useful
for reconstructing legacy texts that were typeset without
diacritics. In addition, it is needed for a growing number
of contemporary texts that lack diacritics, mainly because
there is no accepted standard for encoding diacritics and
users therefore find it easier to omit them when they type.
This practice is especially common in casual forms of
electronic communication such as e-mails, posts on
discussion forums, and chats. Thus, missing diacritics
pose a serious problem not only for automatic text
processing and information retrieval, but also for human
readers.
There are two basic approaches to diacritic restoration:
word-based and character-based [8]. Word-based
approaches are usually implemented as knowledge-
intensive systems that rely on dictionaries and statistical
language models and are therefore language dependent.
These approaches require large corpuses of grammatically
correct text in order to build useful models, and they
require considerable preprocessing time for tokenization,
tagging, and other tasks. In contrast, character-based
systems use language-independent algorithms based on
statistical information that has been learned from training
data. For languages in which diacritics signal grammatical
or semantic roles, word-based systems are much more
reliable than character-based systems [3]. In general, the
choice between the two approaches for restoring diacritics
will depend on several factors: the role of diacritics in the
targeted language, the availability of adequate training
data, the processing speed that is required, and user
requests and needs.
Like European languages, modern Vietnamese uses the
Latin alphabet. However, in addition to the characters used
in English, Vietnamese has letters that are modified with
diacritics: đ, ă, â, ê, ô, ơ, and ư; and it is necessary to use
an Input Method Editor (IME) to enter these special
characters in electronic texts. However, IMEs are slow,
difficult to install and use. Therefore, many Vietnamese
choose to use non-diacritical Vietnamese, which can be
entered using any computer and is easier and quicker to
type. However, non-diacritical Vietnamese is difficult to
understand and can be very confusing.
Word-based approaches to Vietnamese language
processing face two major challenges: there are not enough
textual data (such as dictionaries and corpuses), and the
Vietnamese language does not have a word separator (this
is a problem because word-based approaches must
preprocess word segmentations). Phuong (2007) [10]
reported a 97% accuracy rate for word segmentation of
diacritical Vietnamese. However, the accuracy rate will be
considerably lower for non-diacritical Vietnamese, where
word segmentation is much more complex and difficult.
The abundance of diacritics along with the absence of a
word separator also make Vietnamese a difficult language
for traditional character-based restoration, which can be
expected to yield a high degree of accuracy only for
languages whose diacritics can be restored without
examining the context [7]. A new and powerful approach
to restoring diacritics is therefore needed for the
Vietnamese language.
This is the first research to address the problem of
restoring diacritical marks in non-diacritical Vietnamese
texts. We propose a pointwise approach that automatically
restores missing diacritics using three types of features for
classification: n-grams of syllables, n-grams of syllable
types, and dictionary word features. The pointwise
approach is simple, powerful, and relatively robust to rare
cases that may occur in the text, with little reduction in
accuracy [1].
The rest of this paper is organized as follows. Section
II presents a brief overview of Vietnamese orthography
and some statistical data. Section III describes our
approach to restoring diacritics, and Section IV shows the
2012 International Conference on Asian Language Processing
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.18
189