Association Rules Mining for Name Entity Recognition Indra Budi Computer Science Faculty, University of Indonesia, indra@cs.ui.ac.id Stéphane Bressan School of Computing, National University of Singapore, steph@nus.edu.sg Abstract We propose a new name entity class extraction method based on association rules. We evaluate and compare the performance of our method with the state of the art maximum entropy method. We show that our method consistently yields a higher precision at a competitive level of recall. This result makes our method particularly suitable for tasks whose requirements emphasize the quality rather than the quantity of results. 1. Introduction The work presented in this paper 1 is part of a larger effort to develop information retrieval and linguistic systems, tools, and techniques for an Indonesian Digital Library (see [12,16]). The application that motivates this research requires that semantic structures in XML or RDF be extracted from and superimposed on a corpus of documents written in the Indonesian language. Another motivation is the Semantic Web (abstract representation of data on the World Wide Web) can be achieved by extracting semantic structure from those documents. The natural first step of this project consists in identifying name entities from the texts in the corpus. Name Entity Recognition (NER) is an information extraction task that is concerned with the recognition and classification of name entity from free text [11]. Name entities classes are, for instance, locations, person named, organization named, dates, and money amounts. To terms and expressions in the text correspond the entities they represent. For example, in the sentence: “British Foreign Office Minister O'Brien (right) and President Megawati pose for photographers at the Palace”, a name entity recognition process looking for named persons and locations would identify the two persons O'Brien and Megawati and the location Palace. This recognition can be based on a variety of features of the terms, the sentence, the text and its syntax and could leverage external sources of information such as thesauri and dictionaries, for instance. In the example, a system may have applied a simple rule guessing that the capitalized 1 An extended version of this paper is available as [8] words directly following the terms ‘President’ or Minister’ are names of persons. In this paper, we propose a method for name entity class recognition based on such rules in the form of association rules. The association rules defining the patterns of syntax and term features potentially defining the classes are mined from a training set of documents. The rest of the paper is organized as follow. We present and discuss some background and related work on name entity class recognition in the next section. The method is presented in section 3 in which we detail the learning or mining phase and the testing and tagging phase. We evaluate and compare our method with the state of the art maximum entropy method [7] in section 4. Finally we conclude and identify the next steps of our research. 2. Background and Related Work The first family of approaches to name entity recognition relies on the hand crafting of models and techniques for the recognition of entity classes. Generally the models consist of a set of patterns using grammatical (e.g. part of speech), syntactic (e.g. word precedence) and orthographic features (e.g. capitalization) in combination with dictionaries and thesauri. An example pattern in this type of system is the one we have suggested in our introductory example: “If a proper noun follows a person’s title, then the proper noun is a person’s name”. In this family of approaches Appelt et al. propose a name identification system based on carefully hand- crafted regular expression [2,3], while Iwanska [13] uses extensive specialized resources such as gazetteers, and white and yellow pages. Morgan, for the same purpose, uses a highly sophisticated linguistic analysis [14]. These approaches are relying on manually coded rules and manually compiled corpora. They often yield prohibitive development and maintenance costs. Furthermore, for cost reduction and effectiveness reasons, they are often domain and language specific and do not necessarily adapt well to new domains and languages. The alternative to hand-crafted approaches is the use of data mining, knowledge discovery and machine learning techniques. Both Sekine [15] and Bennett [4] propose name identification systems based on decision trees. The decision trees in both approaches use such Proceedings of the Fourth International Conference on Web Information Systems Engineering (WISE’03) 0-7695-1999-7/03 $17.00 © 2003 IEEE