On the Romance Languages Mutual Intelligibility Alina Maria Ciobanu, Liviu P. Dinu Faculty of Mathematics and Computer Science, University of Bucharest Center for Computational Linguistics, University of Bucharest alina.ciobanu@my.fmi.unibuc.ro,ldinu@fmi.unibuc.ro Abstract We propose a method for computing the similarity of natural languages and for clustering them based on their lexical similarity. Our study provides evidence to be used in the investigation of the written intelligibility, i.e., the ability of people writing in different languages to understand one another without prior knowledge of foreign languages. We account for etymons and cognates, we quantify lexical similarity and we extend our analysis from words to languages. Based on the introduced methodology, we compute a matrix of Romance languages intelligibility. Keywords: Romance languages, etymology, cognates, string similarity 1. Introduction and Related Work Determining degrees of similarity between the world’s lan- guages is an intensely debated issue (Lebart and Rajman, 2000), many of the controversies in historical and com- parative linguistics being centered on language classiﬁca- tion (McMahon and McMahon, 2003). In spite of the fact that linguistic literature abounds in claims of classiﬁcation of natural languages, McMahon and McMahon (2003) ar- gue for the necessity of development of quantitative and computational methods in this ﬁeld. Methods for com- paring languages are constantly developed and periodically reassessed (Ringe et al., 2002; Alekseyenko et al., 2012; Atkinson et al., 2005; Barbancon et al., 2013) and many of them have crossed the discipline boundaries by borrow- ing computational tools from different ﬁelds (Bortolussi et al., 2011). Dyen et al. (1992) investigate the classiﬁ- cation of Indo-European languages by applying a lexico- statistical method. Campbell (2003) analyzes various ap- proaches used over time for establishing relationships be- tween languages, emphasizing the popularity of the com- parative method. Barbancon et al. (2013) show that the difﬁculty in the evaluation of the results regarding phylo- genetic trees reconstruction resides in the variety of com- putational methods used and in the differences in datasets. McMahon and McMahon (2003) point out that in many sit- uations the similarity of natural languages is a fairly vague notion, both linguists and non-linguists having rather in- tuitions about which languages are more similar to which others; in some cases, they are based on the very subjective opinions of the authors. If grouping of languages in linguis- tic families is generally accepted, the relationships between languages belonging to the same family are still controver- sial and are periodically investigated. Degrees of similar- ity between languages are far from being certain; values vary considerably from one researcher to another, not only for exotic languages, but even for extensively studied lan- guages, many of which are closely related. According to Gooskens (2007), some genetically related languages are so close to each other, that the speakers are able to communicate without prior instruction. Gooskens et al. (2008) analyze several phonetic and lexical predic- tors of intelligibility and, to determine the relevance of each linguistic level, they correlate the intelligibility scores with lexical and phonetic distances. Their analysis leads to the conclusion that the two levels are to a large extent inde- pendent and that linguistic distances can successfully pre- dict intelligibility between closely related languages. Re- garding lexical distances, they account for the number of non-cognates, arguing that these words are basically unin- telligible to listeners without prior knowledge of the con- sidered language and that intelligibility is inversely related to the number of non-cognates. The language intelligibil- ity problem is also mentioned in the report published in 2007 at the European Commission by the High Level Group on Multilingualism (HLGM), which emphasizes “a lack of knowledge about mutual intelligibility between closely re- lated languages in Europe and the lack of knowledge about the possibilities for communicating through receptive mul- tilingualism, i.e., where speakers of closely related lan- guages each speak their own language”. In today’s context of European multilingualism and massive population mo- bility, a deeper insight into this matter might not have only a theoretical, cultural, communicative, educational or scien- tiﬁc impact, but an economic or business impact as well. In this paper we investigate the similarity of natural languages with respect to their written intelligibility, i.e., the ability of people writing in different languages to understand one another without prior knowledge of foreign languages. The written form of a language is found not only in literature, but in other various forms as well: movie subtitles, on-line news or communication networks (chats, for example). In a broadly accepted sense, a language L 1 is closer to a lan- guage L 2 when texts written in L 2 are easier understood by speakers of L 1 without prior knowledge of L 2 . The reverse is also true. In other words, the higher the intelligibility degree between two languages, the closer they are. 1.1. Our Approach Although there are multiple aspects that are relevant in the study of language relatedness, such as orthographic, phonetic, syntactic and semantic differences, in this paper we focus only on lexical similarity. The orthographic ap- 3313