International Scientific and Practical Conference "Electronics and Information Technologies" (ELIT-2018) A-73 Influence of Unique Words on the Performance of Corpus-Based Keyword Detection Methods O. S. Kushnir, V. V. Yaremkiv, I. Y. Dovhan, and A. I. Kashuba Department of Optoelectronics and Information Technologies Ivan Franko National University of Lviv 107 Tarnavsky Street, 79017 Lviv, Ukraine o.s.kushnir@lnu.edu.ua, volodymyr.yaremkiv@lnu.edu.ua Abstract—We study the performance of corpus-based key- word detection methods, including TF-IDF, in a particular case when a text under investigation contains unique words, which are absent or rare in the other texts of corpus. The two points are subjects of our main attention, the quality of keyword list and propriety of the corresponding keyness scores, as well as criticality of the methods to small perturbations of the corpus. We conclude that a number of heuristically introduced TF-IDF- like measures compete quite successfully with TF-IDF in their performance but, on the other hand, they cannot cope with the problem of criticality of their scores inherent to the unique words Index Terms—Keywords; corpus-based keyword detection methods; TF-IDF; unique words; criticality I. INTRODUCTION Since keywords summarize in a concise manner the main semantics and contents of texts, automated extraction of these words represents a useful tool for the fields of indexing and categorization of textual documents and, more generally, in textual data mining and information retrieval. Roughly, key- word detection methods can be divided into ‘domain-depend- ent’ and ‘domain-independent’ groups, according to whether they involve a reference textual database (a collection of texts, or corpus) or not. The term ‘domain-dependent’ implies that a corpus can be referred to some domain or topic, so that the keywords extracted from a text under analysis reflect the meanings that distinguish a given text against the background of the domain described by a corpus. For instance, a word ‘physical’ is hardly a keyword in the case of corpus associated with pure physics, although it can quite happen that it is so with respect to a more general collection of texts, e.g. on natural or social sciences. In spite of this inconvenience as well as evident drawbacks linked to relatively low operation speed, the corpus- based keyword detection methods, e.g. a well-known TF-IDF (Text Frequency – Inverse Document Frequency) approach [1, 2], play a central role in modern web search engines. More- over, in most cases they outperform standard domain- independent detection techniques that rely upon a single text under test and engage no corpora (see [3–7]). As a consequen- ce, the results derived with the corpus-based methods can be used as an authoritative reference or benchmark, when comparing various (higher-speed) domain-independent methods and judging which of them is better. In this respect, the data of corpus-based methods can be regarded as a useful alternative to commonly used human-made keyword lists which, of course, might be subjective. Despite a large amount of empirical and theoretical work on the domain-dependent methods for detecting keywords, we believe that the subject is still not concluded. In particular, this concerns a point of our present attention, so-called ‘unique words’. We define them as the words present in a given text (to be compared with a corpus) but absent in all the texts of the corpus. Put another way, the number n t of texts from the corpus where such a ‘truly unique’ word t occurs is equal to n t = 0. We have found that the unique words represent a rather general phenomenon, being typical for a large majority of texts. Neologisms, words invented on purpose, uncommon and rarely used scientific or technical terms, and even typos are ready examples. It is also useful to expand the discussion to the case of so-called ‘quasi-unique’ words that occur very rarely in a corpus (n t << n though n t > 0, with n being the overall number of texts in a corpus). Like the ‘truly unique’ words, the ranking of ‘quasi-unique’ words yielded by the domain-dependent methods can suffer from criticality. For a convenience, our term ‘unique word’ embraces the both classes of ‘truly unique’ and ‘quasi-unique’ words. Although these criticality problems are intuitively well understood by a wide information-retrieval community, the appropriate analysis has been chiefly reduced to rather schematic or purely qualitative arguments. To the best of our knowledge, the problem has still not been addressed in a direct quantitative manner. In the present work we study and compare the performance characteristics for a number of corpus-based keyword detection methods under the condition when the unique words are available in the text. II. MATERIALS AND METHODS A. Corpus and Texts under Analysis We have prepared a corpus of literary works taken from the free text collection “Project Gutenberg” [8]. There is n = 4829 texts in our corpus and its size amounts to 1.88 GB in UTF-8 coding. The total length L of all the texts in the units of word tokens is approximately equal to L = 3.81×10 8 , while the total vocabulary V in the units of word types is V = 1.23×10 6 . Then the average text length l m in this corpus is nearly l m = 0.79×10 5 . The main text we have analyzed is J. R. R. Tolkien’s novel