Multilingualism in Ancient Texts: Language Detection by Example of Old High German and Old Saxon Zahurul Islam 1 , Roland Mittmann 2 , Alexander Mehler 1 1 AG Texttechnology, Institut für Informatik, Goethe-Universität Frankfurt 2 Institut für Empirische Sprachwissenschaft, Goethe-Universität Frankfurt E-mail: zahurul, mittmann, mehler@em.uni-frankfurt.de Abstract In this paper, we present an approach to language detection in streams of multilingual ancient texts. We introduce a supervised classifier that detects, amongst others, Old High German (OHG) and Old Saxon (OS). We evaluate our model by means of three experiments that show that language detection is possible even for dead languages. Finally, we present an experiment in unsupervised language detection as a tertium comparationis for our supervised classifier. Keywords: Language identification, Ancient text, n-gram, classification, clustering 1. Introduction With the rise of the web, we face more and more on-line resources that mix different languages. This multilin- gualism of textual resources poses a challenge for many tasks in Natural Language Processing (NLP). As a con- sequence, Language Identification (LI) is now an indis- pensable step of preprocessing for many NLP applica- tions. This includes machine translation, automatic speech recognition, text-to-speech systems as well as text classification in multilingual scenarios. Obviously, LI is a well-established field of application of NLP. However, if one looks at documents that were written in low-density languages or documents that mix several dead languages, adequate models of language detection are rarely found. In any event, ancient lan- guages are becoming more and more central in approach to computational Humanities, historical semantics and studies on language evolution. Thus, we are in need of models of language detection of dead languages. In this paper, we present such a model. We introduce a supervised classifier that detects amongst others, OHG and OS. To do so, we extend the model of (Waltinger and Mehler, 2009) so that it also accounts for dead languages. For any segment of the logical document structure of a text, our task is to detect the corresponding language in which it was written. This detection at the segment level rather than at the level of whole texts allows us to make explicit the multilingualism of ancient documents start- ing from the level of words via the level of sentences up to the level of texts. As a result, language-specific pre- processing tools can be used in such a way that they focus on those segments that provide relevant input for them. In this way, our approach is a first step towards building a preprocessor of multilingual ancient texts. The paper is organized as follows: Section 3 describes the corpus of texts that we have used for our experiments. Section 4 briefly introduces our approach to supervised language detection, which is evaluated in Section 5. Section 6 describes unsupervised language classifier. Finally, a conclusion is given in Section 7. 2. Related Work As we present a model of n-gram-based language detec- tion, we briefly discuss work in this area. (Cavnar and Trenkle, 1994) describe a system of n-gram based text and language categorization. Basically, they calculate n-gram profiles for each target category. Cate- gorization occurs by means of measuring the distances of the profiles of input documents with those of the target categories. Regarding language classification, the accu- racy of this system is 99.8%. The same technique has been applied by (Mansur et al., 2006) for text categorization. In this approach, a corpus of newspaper articles has been used as input to categori- zation. (Mansur et al., 2006) show that n-grams of length 2 and 3 are most efficiently used as features for text