Sociolinguistic Extension of the ORD Corpus of Russian Everyday Speech Natalia Bogdanova-Beglarian, Tatiana Sherstinova ( ✉ ) , Olga Blinova, Olga Ermolova, Ekaterina Baeva, Gregory Martynenko, and Anastasia Ryko Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg 199034, Russia {n.bogdanova,t.sherstinova,o.blinova,g.martynenko}@spbu.ru, {o-ermolova,aryko}@mail.ru, ekaterinabaeva@yahoo.com Abstract. The ORD corpus is one of the largest resources of contemporary spoken Russian. By 2014, its collection numbered about 400 h of recordings made by a group of 40 respondents (20 men and 20 women, of diﬀerent ages and professions), who volunteered to spend a whole day with a switched-on voice recorder, recording all their verbal communication. The corpus presents the unique linguistic material recorded in natural communicative situations, allowing spoken Russian and the everyday discourse to be studied in many aspects. However, the original sample of respondents was not suﬃcient enough to study a sociolinguistic variation of speech. Thus, it was decided to launch a large project aiming at the ORD sociolinguistic extension, which was supported by the Russian Science Foundation. The paper describes the general principles for the sociolin‐ guistic extension of the corpus. It deﬁnes social groups which should be presented in the corpus in adequate numbers, sets criteria for selecting participants, describes the “recorder’s kit” for the respondents and involves the adaptation principles of the ORD annotation and structure. Now, the ORD collection exceeds 1200 h of recordings, presenting speech of 127 respondents and hundreds of their interlocutors. 2450 macro episodes of everyday spoken communication have been already annotated, and the speech transcripts add up to 1 mln words. Keywords: Speech corpus · Everyday spoken Russian · Oral communication · Sociolinguistics · Social groupings · Sociolects · Speech variation 1 Introduction In sociolinguistic studies of the last decade one may observe the increasing use of corpora and it is expected that variational linguistics “will increasingly interact with corpus-based approaches to linguistics from other areas” [1]. Some examples of socio‐ linguistic research performed on the base of linguistic corpora are reviewed in [2, 3]. However, “texts found within most corpora do not contain the kind of material of greatest interest to most sociolinguists, namely, casual everyday speech, often from non-standard language varieties. Large corpora of spontaneously occurring spoken data are still expensive and time-consuming to compile due to problems of transcription and input” [3]. © Springer International Publishing Switzerland 2016 A. Ronzhin et al. (Eds.): SPECOM 2016, LNAI 9811, pp. 659–666, 2016. DOI: 10.1007/978-3-319-43958-7_80 ｔｨｩｳ＠ｩｳ＠ｴｨ･＠｡ｵｴｨｯｲＧｳ＠ｶ･ｲｳｩｯｮ＠ｯｦ＠ｴｨ･＠ｰ｡ｰ･ｲＮ＠ｔｨ･＠ｦｩｮ｡ｬ＠ｰｵ｢ｬｩ｣｡ｴｩｯｮ＠ｩｳ＠｡ｶ｡ｩｬ｡｢ｬ･＠｡ｴ＠ｨｴｴｰＺＯＯｬｩｮｫＮｳｰｲｩｮｧ･ｲＮ｣ｯｭＯ｣ｨ｡ｰｴ･ｲＯＱＰＮＱＰＰＷＯＹＷＸＭＳＭＳＱＹＭＴＳＹＵＸＭＷ｟ＸＰ