Digital Humanities 2023 Digital Edition of Complete Tolstoy's Heritage: OCR Crowd Sourcing Initiative, Literary Scholarship and User Scenarios Bonch-Osmolovskaya, Anastasia abonch@gmail.com DH CLOUD; TOLSTOY DIGITAL Orekhov, Boris nevmenandr@gmail.com DH CLOUD; Institute of Russian Literature (Pushkin House), Russia Tolstaya, Fekla 6975991@gmail.com TOLSTOY DIGITAL The publication of the 90-volume complete Tolstoy’s edition took thirty years (1928-1958). Despite the great effort put into the collection, the print run was small, making it a bibliographic rarity today. The edition contains more than 7.8 mln words, or 44,350 pages of extraordinarily diverse text, and consists of three parts: Fiction and Essays, including previously unpublished versi- ons and drafts (volumes 1-45), Tolstoy's Diaries (volumes 46-58), Tolstoy's Correspondence (59-90). In 2014 the Tolstoy Museum in Moscow and IT company, ABBYY, a leader in OCR techno- logy launched a unique crowdsourcing project "All Tolstoy in one click". The 90-volume edition was digitized using ABBYY's OCR technology and then proofread by thousands of volunteers from forty-nine countries within two weeks. The xml files that emerged from the crowdsourcing project gave birth to the idea of develo- ping a fully-fledged digital edition of Tolstoy's heritage. The crucial conceptual decision we had to start was whether we stick with the 90-volume publication as the material source, and create the digital version of the book, or we create a new digital edition of Tolstoy's heritage (Bonch-Osmolovskaya Skorinkin et al 2019). In this respect, digital output can pursue three goals, each of which has a direct influence on the final product: 1. Preservation of Tolstoy's heritage, freed from the editorial construct of the 90-volume complete edition and open to fur- ther expansion by other sources. 2. Preservation of the enormous literary scholarship heritage, contained in the 90-volume edition in various critical appara- tuses. 3. Accessibility of Tolstoy's heritage to the digital user, which ultimately means other scenarios of interaction with texts and their interpretations. These same three goals were set by the creators of the "91st vo- lume" application, which implements in electronic form one part of the complete works of Tolstoy, an index to the edition (Orek- hov et al. 2018, Orekhov 2020). The three goals are in fact in conflict, i.e. choosing only one of them would make it more difficult to fulfill the others. The first goal can be achieved, for example, by extracting only Tolstoy's text from the recognized OCR files. The second goal can best be achieved by creating a digital diplomatic edition (Pierazzo 2011). To achieve the third goal, one could, for example, skip the time- consuming phase of TEI preparation. We have worked out a compromise that strikes a balance bet- ween the three goals and so far protects each of them through a series of specific conceptual choices. 1. We focus on Tolstoy’s digital heritage with 90-volume edi- tion being the primary, but not the only source. This means that we work with texts and not with volumes. We extract all of Tolstoy's text and provide each document with detailed metadata. We introduce the concept of a “family” of works, an abstract unit that groups together all variants of a text (Ro- binson 2013) and commentaries and other related works. The titles of all members of the family are converted into a ma- chine-readable format, to between Tolstoy’s original titles from those assigned that have been given by the editors of a volume. Each work in a family has a status tag that is used to create a hierarchy between the main work and the different types of its variants. Finally, families themselves can be lin- ked if there is an important connection between them. The screenshot below shows an example. 2. We carefully include all critical apparatus, presented in the 90-volume edition. The printing practices of the mid-20th century involve certain reader scenarios that should be trans- formed into digital scenarios. For example, the index, which occupies a separate additional 91 volume, needs to be trans- formed into a database linked to a text span rather than the pages of a volume (Iglesia, Göbel 2015). The value of the in- dex should not be underestimated: it cannot be replaced by a NER algorithm. The index reflects years of thorough editorial work aimed to clarify indirect mentions of people (Orekhov 2020). The 91 volume had not been proofread by volunte- ers and was, above all, full of OCR errors. So first we parsed and corrected the index, converted it to a database, assigned identifiers to each person entry and linked them to wikidata if possible. Secondly, we used SpaCy NER to extract all the persons mentioned in the documents and then automatically picked out the best candidates from the database. We ranked the candidates with similarity weights and then manually che- cked those whose weight was not 100%. 3. We have developed user scenarios to navigate the vast Tol- stoy heritage. For example, we have simplified the genre sys- 1