Data-Driven Approach to Identification of Latin Phrases in Russian Web-Crawled Corpora Vladimír Benko 1,2 and Katarína Rausová 1 1 Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics, Bratislava, Slovakia 2 Comenius University in Bratislava, UNESCO Chair in Plurilingual and Multicultural Communication, Bratislava, Slovakia {vladimir.benko,katarina.rausova}@juls.savba.sk Abstract Latin phrases are an integral part of the language of educated speakers in many (European) languages. Besides lexical units of Latin origin that have been already adapted to the orthography of the respective host language and calques, phrases retaining the original form and orthography can also be found in many texts. Due to the rather low frequency of the phenomenon, however, any systematic attempt of its analysis was a real challenge before the advent of very large (multi-Gigaword) corpora. Our paper presents a method of semi-automatic detection of Latin phrases in a Russian web corpus based on applying a Latin tagger and a series of filtrations performed by standard Linux utilities. The preliminary analysis of the resulting candidate list is shown in the concluding part of the paper. Keywords: Latin Quotations, Code Switching, Corpus-Driven Approach Reference for citation: Benko V., Rausová K. Data-Driven Approach to Identification of Latin Phrases in Russian Web-Crawled Corpora // Computer Linguistics and Computing Ontologies. Vol. 4 (Proceedings of the XXIII International Joint Scientific Conference «Internet and Modern Society», IMS-2020, St. Petersburg, June 17-20, 2020). - St. Petersburg: ITMO University, 2020. P. 11 – 20. DOI: 10.17586/0000-0000-2020-4-11-20 Более того, здесь есть своя "старуха ex machina" Антонида Васильевна, внезапно возвращающаяся с порога смерти, меняющая расклад в семействе Загорянских и заражающая главного героя учителя Алексея игорной страстью [1]. Introduction Amount of lexical evidence for low-frequency lexical items, such as idioms and other types of fixed expressions, could hardly be considered sufficient not only in the pre-corpus times, but also during early decades of corpus linguistics. Linguistic analysis of and lexicographic treatment of such phenomena had to be based on a rather small number of examples found in collections of citations slips, or often hapax occurrences in first-generation corpora. Even with a 100 Megaword corpus at hand, a corpus-based methodology only could be applied, i.e. attesting occurrences of the “suspected” phrases based on their lists found in legacy lexicographic works. With the advent of the “big data” paradigm to corpus linguistics in the form of multi- Gigaword corpora, as well as with the availability of robust tolls for their linguistic annotation, the situation gradually began to change. Russian also belongs to languages with corpora of this class available, such as enTenTen [2], GICR [3, 4], Araneum Russicum [5], Taiga [6], or Omnia Russica [7]. Having such resources at hand, linguists are not only capable of finding many more Компьютерная лингвистика и вычислительные онтологии. Вып. 4. 2020 11