A Comparative Evaluation of Term Recognition Algorithms Ziqi Zhang, José Iria, Christopher Brewster and Fabio Ciravegna Department of Computer Science, University of Sheffield, Sheffield, S1 4DP Initial.LastName@dcs.shef.ac.uk Abstract Automatic Term recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process. 1. Introduction Automatic Term Recognition (ATR) is an important research area that deals with the extraction of technical terms from domain-specific language corpora. ATR is often a processing step preceding more complex tasks, such as semantic search (Bhagdev et al. 2007) and especially ontology engineering (Park, Byrd, & Boguraev 2003; Brewster et al. 2007). There have been many studies into ATR. In the majority of these studies (Ananiadou 1994; Bourigault 1992; Fahmi, Bouma, & van der Plas 2007; Frantzi & Ananiadou 1999; Wermter & Hahn 2005) linguistic processors (e.g. POS tagger, phrase chunker) are used to filter out stop words and restrict candidate terms to nouns or noun phrases, while in others any n-gram sequences are selected as candidate terms (Deane 2005). Statistical measures are then used to rank the candidate terms. These measures can be categorised into two kinds: measures of ‘unithood’ indicating the collocation strength of units that comprise a single term; and measures of ‘termhood’ indicating the association strength of a term to domain concepts. For measuring ‘unithood’ measures such as mutual information (Daille 1996), log likelihood (Cohen 1995), t-test (Fahmi, Bouma, & van der Plas 2007; Wermter & Hahn 2005), and the notion of ‘modifiability’ and its variants (Caraballo & Charniak 1999; Deane 2005; Wermter & Hahn 2005) are employed. In contrast, measures for ‘termhood’ are circumscribed to frequency- based approaches and the use of reference corpora: the classic TFIDF used in (Evans & Lefferts 1995; Medelyan & Witten 2006); the notion of ‘weirdness’ as introduced in (Ahmad, Gillam, & Tostevin 1999), which compares the term frequency in the corpus with its frequency in a reference corpus from a different domain; and measures such as ‘domain pertinence’ in (Sclano & Velardi 2007) and ‘domain specificity’ in (Kozakov et al. 2004; Park, Byrd, & Boguraev 2002), which extend and revise ‘weirdness.’ The trend in recent research is to use hybrid approaches, in which ‘unithood’ and ‘termhood’ are combined to produce an unified indicator, such as ‘C-value’(Frantzi & Ananiadou 1999), and many others (Fahmi, Bouma, & van der Plas 2007; Kozakov et al. 2004; Park, Byrd, & Boguraev 2002; Sclano & Velardi 2007). Despite the plethora of methods available, seldom is the full range of the problem dealt with by any one method. Firstly, most works rely on the simplifying assumption that the majority of terms consist of multi-word units (Caraballo & Charniak 1999; Deane 2005; Fahmi, Bouma, & van der Plas 2007; Wermter & Hahn 2005). However, while (Nakagawa & Mori 1998) claims that 85% of domain-specific terms are multi-word units, (Krauthammer & Nenadic 2004) claims that only a small percentage of gene names are multi-word units. Hence, for some domains such an assumption leads to very low recall, which, in turn, can hamper tasks built on top of ATR. Secondly, some approaches (Deane 2005; Frantzi & Ananiadou 1999; Wermter & Hahn 2005) apply frequency thresholds to reduce the algorithm’s search space by filtering out low frequency candidate terms. This, however, does not take into account Zipf’s law (that word frequencies follow highly skewed distributions and there are large number of rare events), again leading to reduced recall. Finally, experimental evaluations throughout the literature are 2108