Automatic Thesaurus Construction Dongqiang Yang | David M. Powers School of Informatics and Engineering Flinders University of South Australia PO Box 2100, Adelaide 5001, South Australia Dongqiang.Yang|David.Powers@flinders.edu.au Abstract 1 In this paper we introduce a novel method of automating thesauri using syntactically constrained distributional similarity. With respect to syntactically conditioned co- occurrences, most popular approaches to automatic thesaurus construction simply ignore the salience of grammatical relations and effectively merge them into one united ‘context’. We distinguish semantic differences of each syntactic dependency and propose to generate thesauri through word overlapping across major types of grammatical relations. The encouraging results show that our proposal can build automatic thesauri with significantly higher precision than the traditional methods. Keywords: syntactic dependency, distribution, similarity. 1 Introduction The usual way of automatic thesaurus construction is to extract the top n words in the similar word list of each seed word as its thesaurus entries, after calculating and ranking distributional similarity between the seed word and all of the other words occurring in the corpora. The attractive aspect of automatically constructing or extending lexical resources rests clearly on its time efficiency and effectiveness in contrast to the time- consuming and outdated publication of manually compiled lexicons. Its application mainly includes constructing domain-oriented thesauri for automatic keyword indexing and document classification in Information Retrieval, Question Answering, Word Sense Disambiguation, and Word Sense Induction. As the ground of automatic thesaurus construction, distributional similarity is often calculated in the high- dimensional vector space model (VSM). With respect to the basic elements in VSM (Lowe, 2001), the dimensionality of word space can be syntactically conditioned (i.e. grammatical relations) or unconditioned (i.e. ‘a bag of words’). Under these two context settings, different similarity methods have been widely surveyed, for example for ‘a bag of words’ (Sahlgren, 2006) and for 1 Copyright (c) 2008, Australian Computer Society, Inc. This paper appeared at the Thirty-First Australasian Computer Science Conference (ACSC2008), Wollongong, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 74. Gillian Dobbie and Bernard Mans, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included. grammatical relations (Curran, 2003; Weeds, 2003). Moreover, the framework conducted by Padó and Lapata (2007) compared the difference between the two settings. They observed that the syntactically constrained VSM outperformed the unconditioned one that exclusively counts word co-occurrences in a ±n window. Given the hypothesis that similar words share similar grammatical relationships and semantic contents, the basic procedure for estimating such distributional similarity can consist of (1) pre-processing sentences in the corpora with shallow or complete parsing; (2) extracting syntactic dependencies into distinctive subsets or vector spaces (Xs) according to head-modifier, including adjective-noun (AN) and adverb or the nominal head in a prepositional phrase to verb (RV) and grammatical roles including subject-verb (SV) and verb- object (VO); and (3) determining distributional similarity using similarity measures such as the Jaccard coefficient and the cosine, or probabilistic measures such as KL divergence and information radius. On the other hand, without the premise of grammatical relations in sematic regulation, calculating distributional similarity can simply work on word co-occurrences. Instead of arguing the pros and cons of these two context representations in specific applications, we focus on how to effectively and efficiently produce automatic thesauri with syntactically conditioned co-occurrences. Without distinguishing the latent differences of grammatical relations in dominating word meanings in context, most approaches simply chained or clumped these syntactic dependencies into one unified context representation for computing distributional similarity such as in automatic thesaurus construction (Hirschman et al., 1975; Hindle, 1990; Grefenstette, 1992; Lin, 1998; Curran, 2003), along with in Word Sense Disambiguation (Yarowsky, 1993; Lin, 1997; Resnik, 1997), word sense induction (Pantel and Lin, 2002), and finding the predominant sense (McCarthy et al., 2004). These approaches improved the distributional representation of a word through a fine-grained context that can filter out the unrelated or unnecessary words produced in the traditional way of ‘a bag of words’ or the unordered context, given that the parsing errors introduced are acceptable or negligible. It is clear that these approaches, based on observed events, often scaled each grammatical relation through its frequency statistics in computing distributional similarity, for example in the weighted (Grefenstette, 1992) or mutual information based (Lin, 1998) Jaccard coefficient.