Proceedings of the Language Technologies for All (LT4All) , pages 137–140 Paris, UNESCO Headquarters, 5-6 December, 2019. c 2019 European Language Resources Association (ELRA), licenced under CC-BY-NC 137 Spoof-Vulnerable Rendering in Khmer Unicode Implementations Joshua Horton, Ph.D., Makara Sok, Marc Durdin, Rasmey Ty National Polytechnic Institute of Cambodia Phnom Penh, Cambodia joshua_horton@sil.org, makara@keyman.com, marc@keyman.com, rasmeyt2@npic.edu.kh Abstract While there are established conventions for typing Khmer using the Unicode Standard, existing systems provide little assistance to users in following the conventions which are thus often ignored. When typing Khmer text, users find that words can be constructed in multiple ways, all of which look ‘correct’ on-screen. Furthermore, some aspects of Khmer as implemented by common operating systems deviate from the Unicode Standard. This leads to a number of negative outcomes, including phishing and spoofing security risks, poor searchability and complications with natural language processing. This paper identifies issues in the encoding of the Khmer language with the Unicode Standard. Keywords: Khmer script, text input, Unicode, security, character ordering Résumé ទបីនទោលរណ៍សប់រយអសរមែរទោយទបី “យូនីូដសតង់ោរ” ៏ទោយ ៏ជំនួយដលនដល់យអនទបីស់នុងរអនុវតមទោលរណ៍ ងទ ទៅនិចួចទៅទីយ។ ទនទេុយអនទបីស់មិនសូវប់រមែណ៍ទអីទេី។ ទេលយអថបទមែរ អនទបីស់យល់ទេច យយនទចីនរទបៀប ទេីយទមីលទៅឹមូវដូចោនទៅទលីទអង់។ ងទនទៅទទៀ នមលនមែរដលូវនអនុវតទោយបេ័នធ បិបតិរទេញនិយម នលខណៈទលអៀងេីរណ់របស់ “យូនីូដសតង់ោរ”។ បាទនយនលទធលអវិជជនទចីន ដូច៖ និភ័យន សុវថិេទោយររឆទនិងលងបនល លទធេនុងរសែងរនមិប និង េសុេែ ញនុងដំទណីររបបធមែិ។ រវវទន រយទីញនូវបាុងរអិនូដនមែរទោយ “យូនីូដសតង់ោរ”។ ីមន (Keyman) នដំទយចទបាងទន សប់មែរនិងទសងទទៀងដរ។ 1. Introduction According to How to Type Khmer Unicode (Open Forum of Cambodia, 2004), the Khmer (Cambodian) language has 33 consonants, 14 independent vowels, 16 dependent vowels, and 13 diacritics. These are assigned individual code points in the Unicode Standard in the range U+1780” through “U+17FF”. However, given the highly complex structure of the Khmer writing system, Unicode Standard implementers have encountered some ambiguity in how words are constructed from those component codepoints, which could result in vulnerability to spoofing. In this paper, ‘Spoof-vulnerable rendering’ is used to describe how incorrectly-encoded clusters can be rendered in a manner that could easily be mistaken for correctly- encoded clusters, whether identical pixel-by-pixel or subtly different. This section illustrates problematic cases this paper seeks to explore. Each example shows a correct encoding, followed by incorrect encodings of the same word rendering identically on common operating systems. The examples show the Unicode codepoints, sample output from Google Chrome 58.0 on Android 6.0.1, and the Google Search results for those encoding. Unicode codepoints for Khmer characters always begin with “U+17”, so only the last two digits will be displayed in our examples. Case #1: (Subscript + Vowel) A word with a subscript and a vowel in different orders look exactly alike on-screen. (See Table 1). (1a) 81 D2 98 C2 9A 29M (1b) 81 C2 D2 98 9A 175K Table 1: មែរ 'Khmer' Case #2: Subscript + [D2+9A] Where there are two subscripts in a word and one of them is [D2+9A], either order of the two are rendered identically (See Table 2). (2a) 9F D2 8F D2 9A B8 4790K (2b) 9F D2 9A D2 8F B8 471K Table 2: សត'woman' Case #3: Subscript + Consonant Shifter Error! Reference source not found. shows a consonant shifter placed before a subscript (3b) and after a subscript (3a), producing near identical display. (3a) 98 D2 99 C9 B6 84 452K (3b) 98 C9 D2 99 B6 84 464K Table 3: មយង 'one kind' Case #4: Consonant Shifter + Vowel Error! Reference source not found. shows a word which could be encoded in four ways, with identical rendering results.