Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), pages 381 - 387
December 8, 2022 ©2022 Association for Computational Linguistics
Beyond Arabic: Software for Perso-Arabic Script Manipulation
Alexander Gutkin
†
Cibu Johny
†
Raiomond Doctor
‡ ∗
Brian Roark
◦
Richard Sproat
⊛
Google Research
†
United Kingdom
‡
India
◦
United States
⊛
Japan
{agutkin,cibu,raiomond,roark,rws}@google.com
Abstract
This paper presents an open-source software
library that provides a set of finite-state trans-
ducer (FST) components and corresponding
utilities for manipulating the writing sys-
tems of languages that use the Perso-Arabic
script. The operations include various lev-
els of script normalization, including visual
invariance-preserving operations that subsume
and go beyond the standard Unicode normal-
ization forms, as well as transformations that
modify the visual appearance of characters in
accordance with the regional orthographies for
eleven contemporary languages from diverse
language families. The library also provides
simple FST-based romanization and transliter-
ation. We additionally attempt to formalize the
typology of Perso-Arabic characters by provid-
ing one-to-many mappings from Unicode code
points to the languages that use them. While
our work focuses on the Arabic script diaspora
rather than Arabic itself, this approach could
be adopted for any language that uses the Ara-
bic script, thus providing a unified framework
for treating a script family used by close to a
billion people.
1 Introduction
While originally developed for recording Arabic,
the Perso-Arabic script has gradually become one
of the most widely used modern scripts. Through-
out history the script was adapted to record many
languages from diverse language families, with
scores of adaptations still active today. This flexi-
bility is partly due to the core features of the script
itself which over the time evolved from a purely
consonantal script to include a productive system
of diacritics for representing long vowels and op-
tional marking of short vowels and phonologi-
cal processes such as gemination (Bauer, 1996;
Kurzon, 2013). Consequently, many languages
productively evolved their own adaptation of the
∗
On contract from Optimum Solutions, Inc.
Perso-Arabic script to better suit their phonology
by not only augmenting the set of diacritics but
also introducing new consonant shapes.
This paper presents an open-source software li-
brary designed to deal with the ambiguities and
inconsistencies that result from representing var-
ious regional Perso-Arabic adaptations in digital
media. Some of these issues are due to the Uni-
code standard itself, where a Perso-Arabic char-
acter can often be represented in more than one
way (Unicode Consortium, 2021). Others are due
to the lack or inadequacies of input methods and
the instability of modern orthographies for the lan-
guages in question (Aazim et al., 2009; Liljegren,
2018). Such issues percolate through the data
available online, such as Wikipedia and Common
Crawl (Patel, 2020), negatively impacting the qual-
ity of NLP models built with such data. The script
normalization software described below goes be-
yond the standard language-agnostic Unicode ap-
proach for Perso-Arabic to help alleviate some of
these issues.
The library design is inspired by and consis-
tent with prior work by Johny et al. (2021), in-
troduced in §2, who provided a suite of finite-
state grammars for various normalization and (re-
versible) romanization operations for the Brah-
mic family of scripts.
1
While the Perso-Arabic
script and the respective set of regional orthogra-
phies we support – Balochi, Kashmiri, Kurdish
(Sorani), Malay (Jawi), Pashto, Persian, Punjabi
(Shahmukhi), Sindhi, South Azerbaijani, Urdu
and Uyghur – is significantly different from those
Brahmic scripts, we pursue a similar finite-state in-
terpretation,
2
as described in §3. Implementation
details and simple validation are provided in §4.
1
https://github.com/google-research/nisaba
2
https://github.com/google-research/nisaba/
tree/main/nisaba/scripts/abjad alphabet
381