arXiv:cmp-lg/9708010 18 Aug 1997 Appears in the proceedings of ACL-EACL ’97 Similarity-Based Methods For Word Sense Disambiguation Ido Dagan Dept. of Mathematics and Computer Science Bar Ilan University Ramat Gan 52900, Israel dagan@macs.biu.ac.il Lillian Lee Div. of Engineering and Applied Sciences Harvard University Cambridge, MA 01238, USA llee@eecs.harvard.edu Fernando Pereira AT&T Labs – Research 600 Mountain Ave. Murray Hill, NJ 07974, USA pereira@research.att.com Abstract We compare four similarity-based esti- mation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency. The similarity-based methods perform up to 40% better on this particular task. We also conclude that events that occur only once in the training set have major impact on similarity-based estimates. 1 Introduction The problem of data sparseness affects all statistical methods for natural language processing. Even large training sets tend to misrepresent low-probability events, since rare events may not appear in the train- ing corpus at all. We concentrate here on the problem of estimat- ing the probability of unseen word pairs, that is, pairs that do not occur in the training set. Katz’s back-off scheme (Katz, 1987), widely used in bigram language modeling, estimates the probability of an unseen bigram by utilizing unigram estimates. This has the undesirable result of assigning unseen bi- grams the same probability if they are made up of unigrams of the same frequency. Class-based methods (Brown et al., 1992; Pereira, Tishby, and Lee, 1993; Resnik, 1992) cluster words into classes of similar words, so that one can base the estimate of a word pair’s probability on the aver- aged cooccurrence probability of the classes to which the two words belong. However, a word is therefore modeled by the average behavior of many words, which may cause the given word’s idiosyncrasies to be ignored. For instance, the word “red” might well act like a generic color word in most cases, but it has distinctive cooccurrence patterns with respect to words like “apple,” “banana,” and so on. We therefore consider similarity-based estimation schemes that do not require building general word classes. Instead, estimates for the most similar words to a word w are combined; the evidence pro- vided by word w ′ is weighted by a function of its similarity to w. Dagan, Markus, and Markovitch (1993) propose such a scheme for predicting which unseen cooccurrences are more likely than others. However, their scheme does not assign probabilities. In what follows, we focus on probabilistic similarity- based estimation methods. We compared several such methods, including that of Dagan, Pereira, and Lee (1994) and the cooc- currence smoothing method of Essen and Steinbiss (1992), against classical estimation methods, includ- ing that of Katz, in a decision task involving un- seen pairs of direct objects and verbs, where uni- gram frequency was eliminated from being a factor. We found that all the similarity-based schemes per- formed almost 40% better than back-off, which is expected to yield about 50% accuracy in our ex- perimental setting. Furthermore, a scheme based on the total divergence of empirical distributions to their average 1 yielded statistically significant im- provement in error rate over cooccurrence smooth- ing. We also investigated the effect of removing ex- tremely low-frequency events from the training set. We found that, in contrast to back-off smoothing, where such events are often discarded from train- 1 To the best of our knowledge, this is the first use of this particular distribution dissimilarity function in statistical language processing. The function itself is im- plicit in earlier work on distributional clustering (Pereira, Tishby, and Lee, 1993), has been used by Tishby (p.c.) in other distributional similarity work, and, as sug- gested by Yoav Freund (p.c.), it is related to results of Hoeffding (1965) on the probability that a given sample was drawn from a given joint distribution.