Textual Characteristics for Language Engineering Mathias Bank † , Robert Remus ‡ , Martin Schierle * †Pattern Science AG, 63579 Freigericht, Germany ‡Natural Language Processing Group, University of Leipzig, Germany * Mercedes-Benz RD North America, Palo Alto, USA m.bank@cid.biz, rremus@informatik.uni-leipzig.de, martin.schierle@daimler.com Abstract Language statistics are widely used to characterize and better understand language. In parallel, the amount of text mining and information retrieval methods grew rapidly within the last decades, with many algorithms evaluated on standardized corpora, often drawn from newspapers. However, up to now there were almost no attempts to link the areas of natural language processing and language statistics in order to properly characterize those evaluation corpora, and to help others to pick the most appropriate algorithms for their particular corpus. We believe no results in the ﬁeld of natural language processing should be published without quantitatively describing the used corpora. Only then the real value of proposed methods can be determined and the transferability to corpora originating from different genres or domains can be estimated. We lay ground for a language engineering process by gathering and deﬁning a set of textual characteristics we consider valuable with respect to building natural language processing systems. We carry out a case study for the analysis of automotive repair orders and explicitly call upon the scientiﬁc community to provide feedback and help to establish a good practice of corpus-aware evaluations. Keywords: Textual Characteristics, Language Engineering, Language Statistics 1. Motivation Language statistics and quantitative linguistics are widely used to study, characterize and better understand lan- guage, to help foreign learners or even to identify authors (Holmes, 1994). Tˇ eˇ sitelov´ a (1992) provides a compre- hensive overview of the large pool of methods available today. Implicitly connected, natural language processing (NLP) methods often rely on statistical methods and ma- chine learning algorithms, which in turn massively rely on certain textual characteristics, e.g. token frequencies, to- ken distributions and token probability transitions. Still, textual characteristics of corpora used for training and test- ing such methods and algorithms are rarely analyzed and documented. We strongly believe the successful creation of real world NLP systems, i.e. the selection of appropriate methods and algorithms, is only possible if the respective text types are soundly understood. Furthermore, we believe scientiﬁc publications in NLP must clearly document lan- guage statistics of the used corpora. This is necessary be- cause not all algorithms work equally on every text type and their portability may be questionable (Sekine, 1997; Escud- ero et al., 2000; Wang and Liu, 2011). Only by knowing the textual characteristics of a certain text type it is possible to estimate the transferability of proposed methods and hence assess their real value. To our best knowledge there is no previous work that uses language statistics to give guidance in building NLP systems, although this is a crucial part of every language engineering (Cunningham, 1999) process. In the next Section, we select and present suitable language statistics. In Section 3. we apply them to English-language corpora from three different genres: news articles, web fora posts and automotive repair orders. In Section 4. we carry out a case study and demonstrate how textual characteris- tics may give guidance to select appropriate algorithms for a successful genre-speciﬁc information extraction system. Finally, we draw conclusions in Section 5. 2. A Language Engineering Fingerprint Although there is a broad range of language statistics avail- able, we only use a carefully handpicked set. We believe this set should be limited to support direct comparisons within one representative chart: a language engineering ﬁngerprint. Furthermore, we only use language statistics, which can be easily and quickly calculated without the need for advanced language processing modules, e.g. part-of- speech (POS) taggers or syntax parsers. Such modules are usually highly text type-dependent (Sekine, 1997) and hence cannot be directly applied to previously unknown text types, as the selection of the most appropriate modules is precisely the goal of the analysis. 1. Shannon’s entropy H measures the average amount of information in an underlying data structure. Applied in the ﬁeld of language engineering, the mean amount of information of a token t i can be calculated by ap- proximating its probability p(t i ) via its frequency in a given corpus. The entropy as given in Formula 1 is normalized to the vocabulary size |V |, i.e. the number of types in the corpus: H = -  ti ∈ V p(t i ) log |V | p(t i ) (1) A high entropy indicates that many words occur with small frequencies – instead of few words that occur with large frequencies. 2. The relative vocabulary size R Voc (Tˇ eˇ sitelov´ a, 1992, chapter 1.2.3.3) is given by the ratio of the vocabu- lary size |V | and the total number of tokens N m with respect to “meaningful” words. These are deﬁned as words, that are not function words (N m = {t | t/ ∈