Selective Sampling for Classiﬁcation François Laviolette, Mario Marchand, and Sara Shanian IFT-GLO, Université Laval, Québec (QC) Canada, G1V-0A6 {first_name.last_name}@ift.ulaval.ca Abstract. Supervised learning is concerned with the task of building accurate classiﬁers from a set of labelled examples. However, the task of gathering a large set of labelled examples can be costly and time- consuming. Active learning algorithms try to reduce this labelling cost by performing a small number of label-queries from a large set of unla- belled examples during the process of building a classiﬁer. However, the level of performance achieved by active learning algorithms is not always up to our expectations and no rigorous performance guarantee, in the form of a risk bound, exists for non-trivial active learning algorithms. In this paper, we propose a novel (and easy to implement) active learn- ing algorithm having a rigorous performance guarantee (i.e., a valid risk bound) and that performs very well in comparison with some widely-used active learning algorithms. 1 Introduction Gathering experimental data is essential for any learning task. In classiﬁcation, we usually gather an amount of training data and then we infer a classiﬁer from it. This methodology is called passive learning. However, in order to build an accurate classiﬁer, it is generally necessary to gather a large number of labelled examples for training set, which, itself, is an expensive and time consuming task. One way to overcome this problem is to use another methodology called active learning. Active learning includes any form of learning in which the learning program has some control over the examples it trains on. Instead of randomly selecting the examples to be labelled, the learning algorithm can more carefully choose or query them from a ﬁnite set of examples (pool based model ) or can choose them from a sequence of examples (stream-based model ). In this way, we expect to reduce substantially the amount of examples to be labelled in order to achieve a given accuracy. A common example is the World-Wide Web, which provides a profusion of training pages for text categorization problems. Thus, the major goal of any active learning algorithm is to obtain a good classiﬁer within a reasonable number of labelling queries. In order to make a query, the active learner goes through the entire pool or stream of unlabelled examples and selects the example to be labelled next (selective sampling ). Active learning methods fall under two main categories based on the criterion used to select the next queries: Uncertainty Sampling and Query by Committee. For uncertainty sampling [7,9], the learner selects the examples to be labelled S. Bergler (Ed.): Canadian AI 2008, LNAI 5032, pp. 191–202, 2008. c  Springer-Verlag Berlin Heidelberg 2008