One-sided Sampling for Learning Taxonomic Relations in the Modern Greek
Economic Domain
Katia Kermanidis and Nikos Fakotakis
Artificial Intelligence Group, Department of Electrical and Computer Engineering
University of Patras, 26500 Rio, Greece
{kerman, fakotaki}@wcl.ee.upatras.gr
Abstract
This paper describes the process of learning
taxonomic relations automatically from Modern Greek
economic corpora. Supervised learning (Decision
trees, Support Vector Machines, Meta-learning) is
applied to economic term pairs; each pair is
represented through a set of statistical, semantic and
syntactic features. The resulting set of feature-value
vectors presents a high imbalance in the class
distribution, due to the large number of term pairs that
do not present a direct semantic relation. This problem
is addressed using One-sided Sampling, which reduces
the number of the majority class instances by removing
examples that are noisy, misleading or redundant. The
approach makes use of no external resources (merely
an economic corpus that is annotated with elementary
morphological and phrase chunking information) and
limited language-dependent elements to facilitate its
portability to other languages and domains. An overall
f-measure of 71% is achieved.
1. Introduction
A domain ontology is the tool that enables
information retrieval, data mining, intelligent search,
consisting of concepts that are important for
communicating domain knowledge. These concepts are
structured hierarchically through taxonomic relations.
A taxonomy usually includes hyperonymy-hyponymy
(is-a), and meronymy (part-of) relations. Learning
taxonomic relations between the concepts that describe
a specific domain automatically from corpora is a key
step towards ontology engineering. The advent of the
semantic web has pushed the construction of concept
taxonomies to the top of the list of interests of
language processing experts.
A complete ontology, however, may also include
further information regarding each concept. The
economic domain, especially, is governed by more
‘abstract’ relations, that capture concept attributes (e.g.
rise and drop are two attributes of the concept value, a
stockholder is an attribute of the concept company).
Henceforth, this type of relation will be referred to as
attribute relation.
The present work proposes a methodology for
automatically detecting taxonomic relations between
the terms that have been extracted from Modern Greek
collections of economic texts. Unlike most previous
work that focuses basically on hyponymy, in this work,
meronymy as well as attribute relations are also
detected. A term pair is governed by an attribute
relation if it does not match the typical profile of an is-
a or a part-of relation. All the aforementioned types of
relations are henceforth called taxonomic in this paper.
The work is part of our ongoing research effort to build
an economic ontological thesaurus for Modern Greek.
We propose a set of syntactic and semantic features
for taxonomy learning. One of the main ideas behind
this work is the ability of the proposed methodology to
be easily applied to other languages and other domains.
For this reason, and unlike several previous
approaches, we try to build a concept hierarchy from
scratch, instead of trying to extend an already existing
ontology. Furthermore, we make use of no external
resources.
The term pairs that do present a taxonomic relation
are outbalanced by the pairs that don’t. This leads to a
class distribution imbalance in the dataset, which in
turn leads to poor classification performance of the
instances of the underrepresented (rare) classes. One-
sided sampling [10] of the majority class instances is
applied to the initial dataset to deal with the class
imbalance problem and it is used for the first time in
the task of learning taxonomic relations. Thereby,
noisy, misleading and redundant examples of the
majority class are discarded and the resulting dataset
presents a smoother, more balanced class distribution.
19th IEEE International Conference on Tools with Artificial Intelligence
1082-3409/07 $25.00 © 2007 IEEE
DOI 10.1109/ICTAI.2007.72
354
19th IEEE International Conference on Tools with Artificial Intelligence
1082-3409/07 $25.00 © 2007 IEEE
DOI 10.1109/ICTAI.2007.72
354
19th IEEE International Conference on Tools with Artificial Intelligence
1082-3409/07 $25.00 © 2007 IEEE
DOI 10.1109/ICTAI.2007.72
354
19th IEEE International Conference on Tools with Artificial Intelligence
1082-3409/07 $25.00 © 2007 IEEE
DOI 10.1109/ICTAI.2007.72
354