Selective Sampling for Information Extraction with a Committee of Classifiers Ben Hachey, Markus Becker, Claire Grover & Ewan Klein University of Edinburgh {bhachey,s0235256,grover,ewan}@inf.ed.ac.uk Introduction We describe the Edinburgh-Stanford approach to task 2 of the Pascal Challenge Evaluating Machine Learning for Information Extraction, 1 the active learning task. Active learning promises to reduce the cost of supervised training by requesting the most informative data points for human annotation. The literature contains a number of approaches to selecting informative data points. Example informativity can be estimated by the degree of uncertainty of a single learner as to the correct label of a data point (Cohn et al., 1995) or in terms of the disagreement of a committee of learners (Seung et al., 1992). Active learning has been successfully applied to a variety of tasks such as document classification (McCallum and Nigam, 1998), part- of-speech tagging (Argamon-Engelson and Dagan, 1999), and parsing (Thompson et al., 1999). We employ a committee-based method where the degree of deviation of different classifiers with respect to their analysis can tell us if an example is potentially useful. Trained classifiers can be caused to be different by bagging (Abe and Mamitsuka, 1998), by randomly perturbing event counts (Argamon-Engelson and Dagan, 1999), or by employing different feature sets for the same classifiers (Baldridge and Osborne, 2004). In this paper, we present active learning experiments for information extraction following the last approach. Our approach gives the highest average improvement over random sampling for micro-averaged and macro-averaged f-scores. System Description We use a conditional Markov model tagger (Klein et al., 2003; Finkel et al., 2005) to train two different models on the same data by splitting the feature set. The tagger incorporates a maximum entropy model classifier and models tag sequences following McCallum et al. (2000). There are a large number of metrics that could be used to quantify the degree of deviation between classifiers in a committee (e.g. KL- divergence, information radius, f-complement). The work reported here uses two sentence-level metrics based on KL-divergence and one based on f-measure. 1 http://nlp.shef.ac.uk/pascal/