Quantifying the Semantic Core of Gender Systems Adina Williams @ Ryan Cotterell S,H Lawrence Wolf-Sonkin S Dami´ an Blasi P Hanna Wallach Z @ Facebook AI Research, New York, USA S Department of Computer Science, Johns Hopkins University, Baltimore, USA H Department of Computer Science and Technology, University of Cambridge, Cambridge, UK P Comparative Linguistics Department, University of Z¨ urich, Switzerland Z Microsoft Research, New York City, USA adinawilliams@fb.com, rdc42@cam.ac.uk, lawrencews@jhu.edu damian.blasi@uzh.ch, hanna@dirichlet.net Abstract Many of the world’s languages employ gram- matical gender on the lexeme. For example, in Spanish, the word for house (casa) is feminine, whereas the word for paper (papel) is masculine. To a speaker of a genderless language, this assignment seems to exist with neither rhyme nor reason. But is the assignment of inanimate nouns to grammatical genders truly arbitrary? We present the first large-scale investigation of the arbitrariness of noun–gender assignments. To that end, we use canonical correlation analysis to correlate the grammatical gender of inanimate nouns with an externally grounded definition of their lexical semantics. We find that 18 languages exhibit a significant correlation between grammatical gender and lexical semantics. 1 Introduction In his semi-autobiographic work about his time traveling through Germany, A Tramp Abroad, Twain (1880) recounted his difficulty when learn- ing the German gender system: “Every noun has a gender, and there is no sense or system in the distri- bution; so the gender of each must be learned sep- arately and by heart. In German, a young lady has no sex, while a turnip has. Think what overwrought reverence that shows for the turnip, and what cal- lous disrespect for the girl.” Although this humor- ous take on German grammatical gender is clearly a caricature, the quote highlights the fact that the rela- tionship between the grammatical gender of nouns and their lexical semantics is often quite opaque. As arbitrary as certain noun–gender assignments may appear overall, a relatively clear relationship often exists between grammatical gender and lexical semantics for some of the lexicon. The portion of the lexicon where this relationship is clear usually consists of animate nouns; nouns referring to people morphologically reflect the sociocultural notion of “natural genders.” This por- tion of the lexicon—the “semantic core”—seems to be present in all gendered languages (Aksenov, 1984; Corbett, 1991). But how many inanimate nouns can also be included in the semantic core? Answering this question requires investigating whether there is a correlation between grammatical gender and lexical semantics for inanimate nouns. Our primary technical contribution is demon- strating that grammatical gender and lexical semantics can be correlated using canonical correlation analysis (CCA)—a standard method for computing the correlation between two multi- variate random variables. We consider 18 gendered languages, following 3 steps for each: First, we encode each inanimate noun as a one-hot vector representing the noun’s grammatical gender in that language; we then create 5 operationalizations of each noun’s lexical semantics using word embed- dings in 5 genderless “donor” languages (English, Japanese, Korean, Mandarin Chinese, and Turkish); finally, for each genderless language, we use CCA to compute the desired correlation between gram- matical gender and lexical semantics. This process yields a single value for each of the 90 gendered– genderless language pairs, revealing a significant correlation between grammatical gender and lexical semantics for 55 of these language pairs. Secondarily, we investigate semantic similarities between the 18 languages’ gender systems—i.e., their assignments of nouns to grammatical genders. We analyze the projections of lexical semantics (operationalized as word embeddings in English) obtained via CCA, finding that phylogenetically arXiv:1910.13497v1 [cs.CL] 29 Oct 2019