Multi-Resolution Disambiguation of Term Occurrences Einat Amitay*, Rani Nelken*, Wayne Niblack**, Ron Sivan*, Aya Soffer* *IBM Haifa Research Lab, Mount Carmel, Haifa 31905, Israel **IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120 {einat,rani,ayas,rsivan}@il.ibm.com; niblack@us.ibm.com ABSTRACT We describe a system for extracting mentions of terms such as company and product names, in a large and noisy corpus of documents, such as the World Wide Web. Since natural language terms are highly ambiguous, a significant challenge in this task is disambiguating which occurrences of each term are truly related to the right meaning, and which are not. We describe our approach for disambiguation, and show that it achieves very high accuracy with only limited training. This serves as a necessary first step for applications that strive to do analytics on term mentions. Categories and Subject Descriptors H.3.3 Information Systems, Information Search and Retrieval General Terms Disambiguation Keywords Information Retrieval, Text Mining, Natural Language Processing 1. INTRODUCTION In recent years, the Web’s importance as a primary knowledge source has continually increased. Due to its wide availability and distributed structure, it allows a large population of users to express various opinions on an unbounded range of topics and issues. Consequently, the Web today contains a treasure trove of information about subjects such as people, companies, organizations, products, etc. that may be of wide interest. A first step toward any Web-based text mining effort would be to collect a significant number of Web mentions of a subject. However, due to the infamous ambiguity of natural language, many subjects have several meanings. This is particularly true for brand names, which are often derived from names of real word objects. Thus, the challenge becomes not only to find all the subject occurrences, but also to consider only those that have the desired meaning. The easiest method of finding the set of mentions of a subject is to use a search engine. This is feasible for relatively rarely-occurring subjects, such as researchers searching for the pages containing their name, or references to their work, but quickly becomes impractical for commonly occurring subjects such as brand names. For example, consider the term Santana. “Santana” is of potentially significant commercial value for Sony Music which is the owner of Columbia Records, Epic Records & Legacy Recordings (all published at least one Santana record). In order to track what people are saying about their music and about Santana in particular on the Web it is necessary to first collect a large number of Web pages that refer to Santana. This would presumably be a first step in a larger application that would also apply sophisticated text-mining analyses to these references. However, even this first step is problematic due to the ambiguity of natural language and its use on the Web. Many Web pages refer to Santana the flower, the high school, the keelboat, the Cycles, the motor agency of Suzuki in Spain, the NFL player, and to many other Santanas, rather than to Santana the band or to its guitarist Carlos Santana. A Google search for Santana yields over 1.3 million hits. Clearly, the average user, who typically examines just the first 1-2 pages of search results, would not be able to go over a large number of them. Even the highly motivated product or brand manager cannot be expected to filter these results without automation. Santana Santana High School Santana High School Santana Row Santana Bicycles Santana Bicycles Santana 20 Santana 20 Carlos Santana Carlos Santana Santana Flower Santana Flower Figure 1 Example for several meanings of the term “Santana” found on the Web In fact, the problem is even harder, since we are actually interested in filtering subjects at the resolution of individual hits. For instance, a Carlos Santana’s Greatest Hits album was recently sold on eBay next to a jersey of the NFL player Santana Moss. Both items were on display two sentences apart. In this paper we present a fully functional system that separates the on-topic occurrences and filters them from the potential multitude of off-topic references to unrelated entities. For the example above, the system would ideally be able to find Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or 03 republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’03, November 3–8, 20, New Orleans, Louisiana, USA. Copyright 2003 ACM 1-58113-723-0/03/0011…$5.00. Santana Moss 255