A Vector Space Model for Syntactic Distances Between Dialects Emanuele Di Buccio, Giorgio Maria Di Nunzio, Gianmaria Silvello Department of Information Engineering, University of Padua {dibuccio, dinunzio, silvello}@dei.unipd.it Abstract Syntactic comparison across languages is essential in the research field of linguistics, e.g. when investigating the relationship among closely related languages. In IR and NLP, the syntactic information is used to understand the meaning of word occurrences according to the context in which their appear. In this paper, we discuss a mathematical framework to compute the distance between languages based on the data available in current state-of-the-art linguistic databases. This framework is inspired by approaches presented in IR and NLP. Keywords: Digital Geolinguistics, Language Distance, Vector Space Model 1. Motivation and Background Syntactic comparison across languages is essential in the research field of linguistics. In fact, the study of closely- related varieties has proven to be extremely useful in find- ing relations between cross-linguistic syntactic differences that might otherwise appear unrelated, and in analysing the linguistic structures in the task of historical recon- struction (Nerbonne and Wiersma, 2006; Colonna et al., 2010). More precisely, syntactic variation studies the ways in which linguistic elements, i.e. words and clitics, are put together to form constituents, that are phrases or clauses. In this context, the analyses of dialectal variation patterns may result in more fine-grained linguistic theories, and em- pirical dialect data may also help improve the validation process of linguistic theories. Therefore, dialectal variation research may contribute to a better understanding of the in- ner workings of the human language system (Spruit, 2008). Different dialectal variants do not occur randomly on the territory and geographical patterns of variation are recog- nizable for an individual syntactic form. In other words, the geographical distribution of an individual syntactic phe- nomenon is often geographically coherent to a certain ex- tent. This indicates that there might be a relationship be- tween syntactic variation and geographical distance. How- ever, when several distribution patterns of syntactic phe- nomena are combined for joint analysis, the interpretation of geographical distributions is less clear (Spruit, 2008). In literature, several approaches for measuring the degree of syntactic differences between varieties have been pro- posed. The techniques are quantitative by nature, which means that the linguistic data are represented and compared numerically using a function which measures the distance between two points (the varieties). Many of the works in this research field use the Hamming distance (Hamming, 1950) to measure the differences between two or more va- rieties (Nerbonne and Wiersma, 2006; Spruit, 2008; Spruit, 2006; Spruit et al., 2009). The Hamming distance is cal- culated between each pair of dialects to obtain a measure- ment based on binary comparisons between feature vari- ants: the distance is increased by 1 for each feature that is observed in one dialect but not in the other. In (Nerbonne and Wiersma, 2006), instead of binary features, the authors use frequency profiles of trigrams of part-of-speech (POS) categories as indicators of syntactic differences. Neverthe- less, since the number of features can be very high, a reduc- tion of the space is usually performed by means of Mul- tidimensional scaling (MDS). MDS is applied to analyse the dialect relationships in the distance matrix. The goal of this procedure in this context is to optimally represent the most differentiating feature variants for each dialect in relation to all other dialects. The results of this reduction to a visible space (two- or three-dimensional space) are vi- sualised with dialect colour maps (Spruit, 2008). For ex- ample, in (Spruit et al., 2009), each dialect’s distance re- lationships to all other dialects are reduced to coordinates in a three-dimensional space using the three most impor- tant dimensions arising from the MDS analysis. These co- ordinates optimally represent the original dialect distance relationships. However, they do not directly correspond to actual dialect distances anymore. More recent approaches try to identify correspondences be- tween languages which are significant against chance and thus call for historical explanation. The computation of the probability of ‘mutation’ of one language into another is based on the application of genetic algorithms. In genetic algorithms, the basic idea is to cluster the population into a number of groups, based on their similarity with respect to a distance metric (Nguyen et al., 2012). A similar ap- proach is discussed in (Colonna et al., 2010), where the Parametric Comparison Method (PCM) is presented. PCM is a new method of language comparison based on the idea that the core grammar of any natural language can in princi- ple be represented by a string of binary symbols, each sym- bol coding the value of a linguistic parameter. Such strings of symbols can be unambiguously collated and language distances and chance probability of agreements precisely measured. This approach starts by computing the distance between two varieties as a Jaccard distance (Jaccard, 1901), then, to graphically represent the genetic similarities be- tween populations, MDS is used to project distance matri- ces in a bi-dimensional space so that the distances between the points approximate the respective degree of dissimilar- ity 2. A Vector Space for Languages Following the work of (Spruit, 2006), the term variable (tag) is central to this work. Generally speaking, a vari- able may be defined as a linguistic unit in which two lan- guage varieties can vary. We define a syntactic variable as a form or word order in a syntactic context where two di- 2486