High-Definition Phonotactics Reflect Linguistic Pasts Jayden L. Macklin-Cordes and Erich R. Round Ancient Language Lab, School of Languages and Cultures University of Queensland, Brisbane, Australia j.macklincordes@uq.edu.au | e.round@uq.edu.au Abstract—Typological datasets for quantitative historical- linguistic inquiry are growing in breadth, but a challenge is also to increase their depth, since advanced methods often ideally require many hundreds of traits per language. Using biphone transition probabilities from phonemicized vocabulary data, we extract several hundred high-definition phonotactic traits per language, for 17 languages in the Ngumpin-Yapa and Yolngu subgroups of the Pama-Nyungan family, Australia. We detect phylogenetic signal at a significant level (p < 0.001 for both subgroups), measured against a reference phylogeny inferred from basic vocabulary cognacy data. This contrasts with simpler, binary coding of biphones’ occurrence, which provides insufficient detail for the detection of phylogenetic signal. Thus, we demonstrate the viability of a new method in quantitative historical linguistics, and emphasize the inferential power to be harnessed from high-definition, trait-rich datasets for comparative research. Keywords—Historical linguistics, Phonology, Phonotactics, Phylogenetic signal, Pama-Nyungan, Ngumpin-Yapa, Yolngu. I. INTRODUCTION A. Richer data; more traits per language Quantitative datasets are increasingly available which span large numbers of languages, yet sophisticated statistical methods often demand high numbers of traits. We investigate the potential of extracting many hundreds of phonotactic traits per language, from phonemicized vocabularies, and test these traits for phylogenetic signal. To set the bar high, we test our method on two language families of Australia. Australian languages are known for the homogeneity of their phonological systems [1]. This ought to provide a barrier to the recovery of phylogenetic signal, and thus, if our methods succeed with this data, we may be optimistic about wider applicability. B. Phonotactic traits All languages permit certain, but not other, sequences of their phonemes. Taking the most basic case, languages may be compared in terms of which two-segment sequences, a+b, they permit. For a set of phonemes in a language {p 1 … p n } this yields an n×n matrix of binary ‘biphone permissibility’ traits. Such data is often provided in descriptive grammars, or can be extracted from phonemicized vocabularies. However, permissibility data is rather coarse. Higher-definition data can be obtained from facts of frequency. For example, a Markov chain (forward) transition probability of a+b, can be calculated as the frequency of occurrence of a+b relative to all sequences a+X in a vocabulary [2], [3]. This yields an n×n matrix of continuous traits. C. Trait inheritance in language change Phonotactic data may offer particular insight into vertical inheritance, since when languages borrow lexicon or coin new lexical items, the incoming items are most often fit into existing phonotactic patterns [4], allowing those patterns persist even under conditions of borrowing and innovation. D. Homogeneity in Australian Phonological Systems Australian languages display a conspicuously low level of phonological diversity, even across distinct language families and in the midst of considerable variation in other linguistic categories [1], [5]–[8]. Common characteristics of Australian phoneme inventories include: • 4–6 places of articulation: labial; velar; 2–4 coronal. • 1 series of stops, with no voicing or length contrast. • No contrastive fricatives. • Nasals at every place of articulation. • 1–4 laterals. • A triangular system of vowel qualities. A ‘typical’ Australian inventory is depicted in Table I. Permissible phonotactic sequences in Australian languages are also highly constrained and similar across the continent [1]. Nevertheless, Gasser & Bowern [9] recently demonstrate that higher-definition frequency data may reveal variation that is not apparent in binary, permissibility data. One contribution of the present study is the first quantification of the difference in phylogenetic signal between coarse, permissibility data and richer, frequency data. II. LANGUAGE DATA We study 17 languages in two subgroups of the large, Pama-Nyungan family: Ngumpin-Yapa [10], [11], which stretches across central Australia, and Yolngu [12], located This research has been supported by ARC grant DE150101024 to E. Round and NSF grant 1423711 to C. Bowern. TABLE II. ‘TYPICAL’ AUSTRALIAN INVENTORY (AFTER [11, P. 141]) Peripheral Apical Laminal Bilabial Dorso- velar Apico- alveolar Apico- retroflex Lamino- dental Lamino- palatal Stop p k t ʈ t c Nasal m ŋ n ɳ n ɲ Lateral l ɭ l ʎ Trill r Glide w ɹ j Front Back High i, iː u, uː Low a, aː