Towards unsupervised extraction of linguistic typological features from language descriptions Søren Wichmann Leiden University wichmannsoeren@gmail.com Taraka Rama University of Oslo tarakaramadaiict@gmail.com 1 Introduction Manual encoding of typological databases is a tire- some procedure that takes large amounts of time. Bender (2016) reviews recent efforts in extracting typological features from interlinear glossed text (Lewis and Xia, 2010), Bible corpora ( ¨ Ostling, 2015; Malaviya et al., 2017), and sources such as morphologically annotated resources and tree- banks (Bjerva and Augenstein, 2018). However, there is a lack of publications describ- ing the application of NLP techniques to extract typological features directly from language de- scriptions contained in grammar books, disserta- tions, and linguistics articles. Collections of such descriptive sources are accumulating as PDFs (in- cluding many from scans) that have subsequently been OCR’ed. In this paper, we describe our ﬁrst attempt at building an NLP pipeline that extracts typological features from OCR’ed linguistic de- scriptions. 2 General approach Our approach to extracting features in WALS (Dryer and Haspelmath, 2013) from the OCR’ed texts consists of two steps. First we detect that part of a text which is most likely to contain a de- scription of a given WALS feature. Next, we try to solve the classiﬁcation problem consisting in ex- tracting exactly one feature value from the target text chunk. That is, unseen chunks hypothesized to discuss a given WALS feature are matched with the general patterns associated with a speciﬁc fea- ture value found in the training set. We use a train- ing set of 10, 000 feature - value - source combi- nations. 3 Identifying text chunks containing feature descriptions Initial preprocessing included cleaning the texts for noisy content. The relevant online WALS chapters were parsed and each put into a text ﬁle. Five different off-the-shelf keyword extrac- tion methods were run on the WALS chapters. Their outputs are, respectively 1. POS-tags, yielding nouns and their frequen- cies 2. Collocations and co-occurrences and their frequencies 3. Keywords and their ranks using the Textrank algorithm 4. Keywords and their frequencies using rapid automatic keyword extraction (RAKE) 5. Noun and verb phrases and their frequencies All our experiments involving keywords were per- formed using the R binding for the UDPipe pack- age (Straka and Strakov´ a, 2017). 1 For each combination of feature and language description the ﬁve above-mentioned keyword- extraction methods were applied to successive windows of 5 chunks of the description in order to ﬁnd the combination of keyword and vector method most adequate for identifying a text chunk as discussing a particular WALS feature. For each window and keyword method, the distance was measured to the WALS chapters using 8 different standard vector distances (or similarities converted to distances): Chebyshev, correlation, cosine, Eu- clidean, Jaccard, Jensen-Shannon, Manhattan, and Soergel, in addition to a new one, called pJaccard. 1 https://www.r-bloggers.com/ an-overview-of-keyword-extraction-techniques/