A Bayesian Approach to Semi-Supervised Learning Rebecca Bruce Department of Computer Science University of North Carolina at Asheville, Asheville, NC 28804 bruce@cs.unca.edu Abstract Recent research in automated learning has focused on algorithms that learn from a combination of tagged and un- tagged data. Such algorithms can be re- ferred to as semi-supervised in contrast to unsupervised, which refers to algo- rithms requiring no tagged data what- soever. This paper presents a Bayesian approach to semi-supervised learning. In this approach, the parameters of a probability model are estimated using Bayesian techniques and then used to perform classification. The prior prob- ability distribution is formulated from the tagged data via a process akin to stochastic generalization. Intuitively, the generalization process starts with a small amount of tagged data and adds to it new pseudo-counts that are simi- lar to those that would be expected in a larger data sample from the same popu- lation. The prior distribution together with the untagged data form the pos- terior distribution which is used to esti- mate the model parameters via the EM algorithm. This procedure is demonstrated by ap- plying it to the task of word-sense dis- ambiguation. When priors are formu- lated from as few as 15 randomly selected tagged instances, the resulting classifier has an accuracy that is 21% higher than the accuracy of a classifier developed us- ing no tagged data. When 700 tagged instances are used to formulate priors, the accuracy of the classifier is greater than the accuracy of a classifier devel- oped from 2,124 tagged instances using standard supervised learning techniques. 1 Introduction Lately, there have been several new learning algo- rithms that use a combination of tagged and un- tagged data; this is referred to as semi-supervised learning. The motivation for semi-supervised learning is that tagged data is expensive while un- tagged data can usually be acquired cheaply. As a result, training data in NLP usually consists of relatively few (if any) tagged data points in a high- dimensional feature space. The idea behind semi- supervised learning is to exploit the tagged data to acquire information about the problem and then use that information to guide learning from the untagged data (i.e., unsupervised learning) in the high-dimensional feature space. Most recent work has approached the prob- lem for the point of view of co-training (Blum & Mitchell 1998). The problem is cast in terms of learning a tagging function f (x) when the features describing each instance can be partitioned into two distinct sets where each set is sufficient to de- fine f (x). In this situation, two distinct classifiers can be defined, one for each set of features. Co- training then consists of iteratively using the out- put of one classifier to train the other in tagging the untagged data. Examples of algorithms that fall into this general category are (Collins & Singer 1999; Blum & Mitchell 1998; Yarowsky 1995). In Nigam et al. (2000) and Collins and Singer (1999), the EM algorithm is used to estimate the param- eters of the Naive Bayes model from both tagged and untagged data. The tagged data is viewed as complete data while the untagged data is viewed as incomplete because the tags are assumed to be missing at random. In this work, the EM algorithm is also used to estimate model parameters from both tagged and untagged data. But, in contrast to the work by Nigam et al. and Collins and Singer, the tagged data is used to formulate an informative prior distribution of pseudo-counts, and these pseudo- counts are combined with the counts in the un- tagged data to formulate the posterior distribu-