Abstract zur Konferenz Digital Humanities im deutschsprachigen Raum 2020 Confounding variables in Sub-Genre classification: instructive problems Jannidis, Fotis fotis.jannidis@uni-wuerzburg.de Universität Würzburg Konle, Leonard leonard.konle@uni-wuerzburg.de Universität Würzburg Leinen, Peter P.Leinen@dnb.de Deutsche Nationalbibliothek This paper started out as a report on the state of the art in text classification, but over time it became much more a reflection on the pitfalls in modeling genre using classification. The start of our research was motivated by developments in text classification: Recent years have seen new approaches like gradient boosting and deep neural networks. Our initial goal was to inform about these approaches, which are seldom used yet in the digital humanities. But this proved to be only a starting point for a deeper exploration of genre structures of our collection of dime novels (‘Heftromane’, ‘Groschenromane’). Most research on genre classification has been looking into what you could call ‘high level classes’ like newspaper genres (news, editorials etc.; e.g. Frank and Bouckaert, 2006) or web genres (blog, personal website etc.; e.g. Eissen and Stein, 2004). Under this perspective all texts we are looking at belong to one genre: the novel. The subgenres are types of love stories like the doctor novel (‚Arztroman‘) or the country novel (‚Heimatroman‘) and types of adventure novels, mainly distinguished by the setting: the war novel (‚Kriegsroman‘) or the science fiction novel. These novels are cheap (‘dime novels’) and published in a booklet format and are usually distributed via magazine kiosks and not book shops (Stockinger 2018). From the very beginning it was clear to us, that they don’t contain a random collection of each genre. On the contrary, the crime novels for example are just a small and very specific subsection of crime novels in general. But nevertheless we assumed that genre is the main aspect to group novels - for publishers and readers. Our dataset consists of 11,600 dime novels from 12 different genres (see Fig.1). The genre label come from the four publishers who divide the market among themselves. (Bastei, Martin Kelter, Pabel Moewig and Cora). The corpus has been documented in previous studies such as Jannidis et. al. (2019a) and Jannidis et. al (2019b). Figure 1: Novels per Genre We have employed three groups of methods: traditional feature-based classifiers (Group A), modern feature-based classifiers (Group B) and deep learning (Group C). While Group A and B are based on document-term-matrix (20,000 most-frequent-words, tf-idf-weighted, stopwords removed, dimensionality reduction with LSI to 1000 features) as input, Group C works with unprocessed text. Named entities are removed completely. Hyperparameter optimization was done by sampling from the space of values recommended by the documentation of the libraries and (Olson et al. 2017) using Optuna (Akiba et al. 2019): In table 1 we report the best performance. We evaluated the performance of the deep learning approaches in advance on a smaller dataset, so that later only the best architecture had to be extensively tested (table 1). To increase speed initialized with pre-trained (wikipedia.de+30.000 novels) fasttext embeddings (Bojanowski et al. 2016). As a compromise between performance and speed we used the BiRNN architecture for all following experiments. Table 1: Prestudy of deep learning architectures (4 subgenres, 800 novels) Fasttext Flair CNN CNN +BiRNN BiRNN HATTN f1-score .886 .931 .925 .935 .923 .926 Time per epoch (seconds) <5 288 210 190 90 215 Time to converge (minutes) 3 48 28 25 6 21