Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5647–5663, November 16–20, 2020. c 2020 Association for Computational Linguistics 5647 Mind Your Inﬂections! Improving NLP for Non-Standard Englishes with Base-Inﬂection Encoding Samson Tan §♮ , Shaﬁq Joty §‡ , Lav R. Varshney ℧§ , Min-Yen Kan ♮ § Salesforce AI Research ♮ National University of Singapore ‡ Nanyang Technological University ℧ University of Illinois at Urbana-Champaign § {samson.tan,sjoty}@salesforce.com ♮ kanmy@comp.nus.edu.sg ℧ varshney@illinois.edu Abstract Inﬂectional variation is a common feature of World Englishes such as Colloquial Singa- pore English and African American Vernacu- lar English. Although comprehension by hu- man readers is usually unimpaired by non- standard inﬂections, current NLP systems are not yet robust. We propose Base-Inﬂection En- coding (BITE), a method to tokenize English text by reducing inﬂected words to their base forms before reinjecting the grammatical infor- mation as special symbols. Fine-tuning pre- trained NLP models for downstream tasks us- ing our encoding defends against inﬂectional adversaries while maintaining performance on clean data. Models using BITE generalize bet- ter to dialects with non-standard inﬂections without explicit training and translation mod- els converge faster when trained with BITE. Fi- nally, we show that our encoding improves the vocabulary efﬁciency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocab- ulary efﬁciency, we propose metrics to do so. 1 1 Introduction Large-scale neural models have proven success- ful at a wide range of natural language process- ing (NLP) tasks but are susceptible to amplifying discrimination against minority linguistic commu- nities (Hovy and Spruit, 2016; Tan et al., 2020) due to selection bias in the training data and model overampliﬁcation (Shah et al., 2019). Most datasets implicitly assume a distribution of error-free Standard English speakers, but this does not accurately reﬂect the majority of the global English speaking population who are either sec- ond language (L2) or non-standard dialect speakers (Crystal, 2003; Eberhard et al., 2019). These World Englishes differ at lexical, morphological, and syn- tactic levels (Kachru et al., 2009); sensitivity to 1 Code will be available at github.com/salesforce/bite. Figure 1: Base-Inﬂection Encoding reduces inﬂected words to their base forms, then reinjects the grammati- cal information into the sentence as inﬂection symbols. these variations predisposes English NLP systems to discriminate against speakers of World Englishes by either misunderstanding or misinterpreting them (Hern, 2017; Tatman, 2017). Left unchecked, these biases could inadvertently propagate to future mod- els via metrics built around pretrained models, such as BERTScore (Zhang et al., 2020). In particular, Tan et al. (2020) show that cur- rent question answering and machine transla- tion systems are overly sensitive to non-standard inﬂections—a common feature of dialects such as Colloquial Singapore English (CSE) and African American Vernacular English (AAVE). 2 Since peo- ple naturally correct for or ignore non-standard inﬂection use (Foster and Wigglesworth, 2016), we should expect NLP systems to be equally robust. Existing work on adversarial robustness for NLP primarily focuses on adversarial training methods (Belinkov and Bisk, 2018; Ribeiro et al., 2018; Tan et al., 2020) or classifying and correcting adversar- ial examples (Zhou et al., 2019a). However, this effectively increases the size of the training dataset by including adversarial examples or training a new model to identify and correct perturbations, thereby signiﬁcantly increasing the overall computational cost of creating robust models. These approaches also only operate on either raw text or the model, ignoring tokenization—an operation that transforms raw text into a form that the neural network can learn from. We introduce a 2 Examples in Appendix A.