One-sided Sampling for Learning Taxonomic Relations in the Modern Greek Economic Domain Katia Kermanidis and Nikos Fakotakis Artificial Intelligence Group, Department of Electrical and Computer Engineering University of Patras, 26500 Rio, Greece {kerman, fakotaki}@wcl.ee.upatras.gr Abstract This paper describes the process of learning taxonomic relations automatically from Modern Greek economic corpora. Supervised learning (Decision trees, Support Vector Machines, Meta-learning) is applied to economic term pairs; each pair is represented through a set of statistical, semantic and syntactic features. The resulting set of feature-value vectors presents a high imbalance in the class distribution, due to the large number of term pairs that do not present a direct semantic relation. This problem is addressed using One-sided Sampling, which reduces the number of the majority class instances by removing examples that are noisy, misleading or redundant. The approach makes use of no external resources (merely an economic corpus that is annotated with elementary morphological and phrase chunking information) and limited language-dependent elements to facilitate its portability to other languages and domains. An overall f-measure of 71% is achieved. 1. Introduction A domain ontology is the tool that enables information retrieval, data mining, intelligent search, consisting of concepts that are important for communicating domain knowledge. These concepts are structured hierarchically through taxonomic relations. A taxonomy usually includes hyperonymy-hyponymy (is-a), and meronymy (part-of) relations. Learning taxonomic relations between the concepts that describe a specific domain automatically from corpora is a key step towards ontology engineering. The advent of the semantic web has pushed the construction of concept taxonomies to the top of the list of interests of language processing experts. A complete ontology, however, may also include further information regarding each concept. The economic domain, especially, is governed by more ‘abstract’ relations, that capture concept attributes (e.g. rise and drop are two attributes of the concept value, a stockholder is an attribute of the concept company). Henceforth, this type of relation will be referred to as attribute relation. The present work proposes a methodology for automatically detecting taxonomic relations between the terms that have been extracted from Modern Greek collections of economic texts. Unlike most previous work that focuses basically on hyponymy, in this work, meronymy as well as attribute relations are also detected. A term pair is governed by an attribute relation if it does not match the typical profile of an is- a or a part-of relation. All the aforementioned types of relations are henceforth called taxonomic in this paper. The work is part of our ongoing research effort to build an economic ontological thesaurus for Modern Greek. We propose a set of syntactic and semantic features for taxonomy learning. One of the main ideas behind this work is the ability of the proposed methodology to be easily applied to other languages and other domains. For this reason, and unlike several previous approaches, we try to build a concept hierarchy from scratch, instead of trying to extend an already existing ontology. Furthermore, we make use of no external resources. The term pairs that do present a taxonomic relation are outbalanced by the pairs that don’t. This leads to a class distribution imbalance in the dataset, which in turn leads to poor classification performance of the instances of the underrepresented (rare) classes. One- sided sampling [10] of the majority class instances is applied to the initial dataset to deal with the class imbalance problem and it is used for the first time in the task of learning taxonomic relations. Thereby, noisy, misleading and redundant examples of the majority class are discarded and the resulting dataset presents a smoother, more balanced class distribution. 19th IEEE International Conference on Tools with Artificial Intelligence 1082-3409/07 $25.00 © 2007 IEEE DOI 10.1109/ICTAI.2007.72 354 19th IEEE International Conference on Tools with Artificial Intelligence 1082-3409/07 $25.00 © 2007 IEEE DOI 10.1109/ICTAI.2007.72 354 19th IEEE International Conference on Tools with Artificial Intelligence 1082-3409/07 $25.00 © 2007 IEEE DOI 10.1109/ICTAI.2007.72 354 19th IEEE International Conference on Tools with Artificial Intelligence 1082-3409/07 $25.00 © 2007 IEEE DOI 10.1109/ICTAI.2007.72 354