Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), pages 381 - 387 December 8, 2022 ©2022 Association for Computational Linguistics Beyond Arabic: Software for Perso-Arabic Script Manipulation Alexander Gutkin Cibu Johny Raiomond Doctor Brian Roark Richard Sproat Google Research United Kingdom India United States Japan {agutkin,cibu,raiomond,roark,rws}@google.com Abstract This paper presents an open-source software library that provides a set of finite-state trans- ducer (FST) components and corresponding utilities for manipulating the writing sys- tems of languages that use the Perso-Arabic script. The operations include various lev- els of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normal- ization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for eleven contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliter- ation. We additionally attempt to formalize the typology of Perso-Arabic characters by provid- ing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Ara- bic script, thus providing a unified framework for treating a script family used by close to a billion people. 1 Introduction While originally developed for recording Arabic, the Perso-Arabic script has gradually become one of the most widely used modern scripts. Through- out history the script was adapted to record many languages from diverse language families, with scores of adaptations still active today. This flexi- bility is partly due to the core features of the script itself which over the time evolved from a purely consonantal script to include a productive system of diacritics for representing long vowels and op- tional marking of short vowels and phonologi- cal processes such as gemination (Bauer, 1996; Kurzon, 2013). Consequently, many languages productively evolved their own adaptation of the On contract from Optimum Solutions, Inc. Perso-Arabic script to better suit their phonology by not only augmenting the set of diacritics but also introducing new consonant shapes. This paper presents an open-source software li- brary designed to deal with the ambiguities and inconsistencies that result from representing var- ious regional Perso-Arabic adaptations in digital media. Some of these issues are due to the Uni- code standard itself, where a Perso-Arabic char- acter can often be represented in more than one way (Unicode Consortium, 2021). Others are due to the lack or inadequacies of input methods and the instability of modern orthographies for the lan- guages in question (Aazim et al., 2009; Liljegren, 2018). Such issues percolate through the data available online, such as Wikipedia and Common Crawl (Patel, 2020), negatively impacting the qual- ity of NLP models built with such data. The script normalization software described below goes be- yond the standard language-agnostic Unicode ap- proach for Perso-Arabic to help alleviate some of these issues. The library design is inspired by and consis- tent with prior work by Johny et al. (2021), in- troduced in §2, who provided a suite of finite- state grammars for various normalization and (re- versible) romanization operations for the Brah- mic family of scripts. 1 While the Perso-Arabic script and the respective set of regional orthogra- phies we support – Balochi, Kashmiri, Kurdish (Sorani), Malay (Jawi), Pashto, Persian, Punjabi (Shahmukhi), Sindhi, South Azerbaijani, Urdu and Uyghur – is significantly different from those Brahmic scripts, we pursue a similar finite-state in- terpretation, 2 as described in §3. Implementation details and simple validation are provided in §4. 1 https://github.com/google-research/nisaba 2 https://github.com/google-research/nisaba/ tree/main/nisaba/scripts/abjad alphabet 381