Geography of Russian poetry: countries and cities inside the poetic world Elizaveta Kuzmenko, Boris Orekhov eakuzmenko_2@edu.hse.ru , borekhov@hse.ru National Research University Higher School of Economics Our paper is dedicated to two major problems: the first problem is the digital one and the second problem is a humane one. The digital problem involves automatic extraction of named entities, and the humane problem is connected to the usage of toponyms in poetic texts. Correspondingly, our research comprises two parts: automatic processing of a huge amount of texts from the corpus of Russian poetry and revealing major trends in the functioning of toponyms during the history of the Russian poetry from XVIII to XX centuries. Our research is based on the data from the poetic corpus which is a part of Russian National Corpus 1 . This corpus includes the main texts belonging to the Russian poetry from all the periods of its history, up to the XX century. The principles of text selection in the poetic corpus are described by its creators (Grishina et al. 2009; http://ruscorpora.ru/search-poetic.html ). The size of the corpus is approximately 11 million word tokens. Up to the present moment, research papers considering toponyms in Russian poetry described a concrete toponym from the perspective of an isolated text or a particular author (see, for example, Mednis 1999). Our approach is quite different: we describe tthe geography of Russian poetry as a whole, consistently to the framework of distant reading (Moretti 2005, Moretti 2013). Thus, the result demonstrates global trends in the usage of toponyms in Russian poetry as a system. We used two different technologies to extract geographic entities from poetic texts, and the comparison of these two approaches is one of the results of our research. The first technology is a proprietary commercial software Textocat 2 , which is based on machine learning with the use of nonfictional texts as a training sample. The creators of this software claim that the F1-measure for the retrieval of named entities is 0.75. However, it is expected that the performance would be much lower in the case of poetic texts, because the language of poetry differs radically from the language of prose. The second approach we use is a self-made tool for the extraction of toponyms based on the dictionary of geographical names. We are forced to create such a tool because there is no open- source software for the extraction of toponyms for Russian. As a basis for our dictionary of geographical names, we use the list of toponyms from Wikipedia. We compared the figures retrieved with our approach to the ones resulting from Textocat. We used for evaluation a sample of toponyms consisting of countries and cities. The comparison showed that Textocat retrieves only 25.7% of country names and 19.3% of city names that are found with our tool. In addition, Textocat makes a lot of mistakes; for example, locative pronouns там 'there' and где 'where' are retrieved among geographical entities. The words страна 'country' and город 'city' are also included by Textocat in the list of found toponyms. As we can see, the dictionary-based approach proves to be more efficient, and in further results we consider only the data extracted with this method. First, we will look in detail on the names of countries extracted from poetic texts. The distribution of mentioning countries is presented in Table 1 (six most popular countries are taken): 1 http://ruscorpora.ru/en/ 2 http://textocat.ru/