Automatic Thesaurus Generation using Co-occurrence Rogier Brussee a Christian Wartena a a Telematica Instituut, P.O. Box 589, 7500 AN Enschede, The Netherlands Abstract This paper proposes a characterization of useful thesaurus terms by the informativity of co- occurence with that term. Given a corpus of documents, informativity is formalized as the information gain of the weighted average term distribution of all documents containing that term. While the resulting algorithm for thesaurus generation is unsupervised, we find that high informativity terms correspond to large and coherent subsets of documents. We evaluate our method on a set of Dutch Wikipedia articles by comparing high informativity terms with keywords for the Wikipedia category of the articles. 1 Introduction We consider the problem of generating a thesaurus for a given collection using statistical methods. This problem is related to, but different from, assigning keywords to a text from a list of keywords, and that of finding the most characteristic terms for a given subset of the corpus. Our approach is to produce a list of terms which are the most informative for understanding the collection as a whole. Part of the attraction of the current approach is that it proposes a statistical model to formalize the notion of a (good) keyword. If we believe in the assumptions leading to this model, the high level algorithms to generate a thesaurus are almost forced upon us. Our main assumption is that co-occurrence of terms is a proxy for their meaning [11, 14, 9]. To use this information, we compute for each term the distribution of all co-occurring terms. We can then use this co-occuring term distribution as a proxy for the meaning of the term in the context of the collection and compare it with the term distribution of a single document. We assume that a document is semantically related to a term if the term distribution of the document is similar to its distribution of co-occurring terms. Fortunately, there is a natural similarity measure for probability distributions, the relative entropy or Kullback-Leibler divergence. If we follow this formalization through, there is an obvious strategy for generating a thesaurus as the set of terms which give the greatest overal information gain, defined as a difference of Kullback-Leibler divergences. In practice this model is a slight oversimplification, e.g. because the same subject can be characterized by different terms. We will discuss this in section 5. The organization of this paper is as follows, in section 2 we discuss some related work. In section 3 we introduce the different probability distributions and the information theoretic notion of Kullback-Leibler divergence that will be the basis for the rest of the paper. We use these in section 4 to give various definitions of information gain that can be used to rank keywords. In section 5 we evaluate our notions on a corpus of the Dutch Wikipedia.