Association Rules Mining for Name Entity Recognition
Indra Budi
Computer Science Faculty, University of
Indonesia, indra@cs.ui.ac.id
Stéphane Bressan
School of Computing, National University of
Singapore, steph@nus.edu.sg
Abstract
We propose a new name entity class extraction
method based on association rules. We evaluate and
compare the performance of our method with the state of
the art maximum entropy method. We show that our
method consistently yields a higher precision at a
competitive level of recall. This result makes our method
particularly suitable for tasks whose requirements
emphasize the quality rather than the quantity of results.
1. Introduction
The work presented in this paper
1
is part of a larger
effort to develop information retrieval and linguistic
systems, tools, and techniques for an Indonesian Digital
Library (see [12,16]). The application that motivates this
research requires that semantic structures in XML or RDF
be extracted from and superimposed on a corpus of
documents written in the Indonesian language. Another
motivation is the Semantic Web (abstract representation
of data on the World Wide Web) can be achieved by
extracting semantic structure from those documents. The
natural first step of this project consists in identifying
name entities from the texts in the corpus.
Name Entity Recognition (NER) is an information
extraction task that is concerned with the recognition and
classification of name entity from free text [11]. Name
entities classes are, for instance, locations, person named,
organization named, dates, and money amounts. To terms
and expressions in the text correspond the entities they
represent. For example, in the sentence: “British Foreign
Office Minister O'Brien (right) and President Megawati
pose for photographers at the Palace”, a name entity
recognition process looking for named persons and
locations would identify the two persons O'Brien and
Megawati and the location Palace. This recognition can
be based on a variety of features of the terms, the
sentence, the text and its syntax and could leverage
external sources of information such as thesauri and
dictionaries, for instance. In the example, a system may
have applied a simple rule guessing that the capitalized
1
An extended version of this paper is available as [8]
words directly following the terms ‘President’ or
‘Minister’ are names of persons.
In this paper, we propose a method for name entity
class recognition based on such rules in the form of
association rules. The association rules defining the
patterns of syntax and term features potentially defining
the classes are mined from a training set of documents.
The rest of the paper is organized as follow. We
present and discuss some background and related work on
name entity class recognition in the next section. The
method is presented in section 3 in which we detail the
learning or mining phase and the testing and tagging
phase. We evaluate and compare our method with the
state of the art maximum entropy method [7] in section 4.
Finally we conclude and identify the next steps of our
research.
2. Background and Related Work
The first family of approaches to name entity
recognition relies on the hand crafting of models and
techniques for the recognition of entity classes. Generally
the models consist of a set of patterns using grammatical
(e.g. part of speech), syntactic (e.g. word precedence) and
orthographic features (e.g. capitalization) in combination
with dictionaries and thesauri. An example pattern in this
type of system is the one we have suggested in our
introductory example: “If a proper noun follows a
person’s title, then the proper noun is a person’s name”.
In this family of approaches Appelt et al. propose a
name identification system based on carefully hand-
crafted regular expression [2,3], while Iwanska [13] uses
extensive specialized resources such as gazetteers, and
white and yellow pages. Morgan, for the same purpose,
uses a highly sophisticated linguistic analysis [14].
These approaches are relying on manually coded
rules and manually compiled corpora. They often yield
prohibitive development and maintenance costs.
Furthermore, for cost reduction and effectiveness reasons,
they are often domain and language specific and do not
necessarily adapt well to new domains and languages.
The alternative to hand-crafted approaches is the use
of data mining, knowledge discovery and machine
learning techniques. Both Sekine [15] and Bennett [4]
propose name identification systems based on decision
trees. The decision trees in both approaches use such
Proceedings of the Fourth International Conference on Web Information Systems Engineering (WISE’03)
0-7695-1999-7/03 $17.00 © 2003 IEEE