Reducing Class Imbalance during Active Learning for Named Entity Annotation Katrin Tomanek Udo Hahn Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universit¨ at Jena, Germany {katrin.tomanek|udo.hahn}@uni-jena.de ABSTRACT In lots of natural language processing tasks, the classes to be dealt with often occur heavily imbalanced in the under- lying data set and classifiers trained on such skewed data tend to exhibit poor performance for low-frequency classes. We introduce and compare different approaches to reduce class imbalance by design within the context of active learn- ing (AL). Our goal is to compile more balanced data sets up front during annotation time when AL is used as a strategy to acquire training material. We situate our approach in the context of named entity recognition. Our experiments reveal that we can indeed reduce class imbalance and increase the performance of classifiers on minority classes while preserv- ing a good overall performance in terms of macro F-score. Categories and Subject Descriptors I.2.6 [Computing Methodologies]: Artificial Intelligence— Learning; I.2.7 [Computing Methodologies]: Artificial In- telligence—Natural Language Processing General Terms Algorithms, Design, Experimentation, Performance 1. INTRODUCTION The use of supervised machine learning has become a stan- dard technique for many tasks in natural language process- ing (NLP). One of the main problems with this technique is its greediness for a priori supplied annotation metadata. The active learning (AL) paradigm [4] offers a promising solution to deal with this demand efficiently. Unlike ran- dom selection of training instances, AL biases the selection of examples which have to be manually annotated such that the human labeling effort be minimized. This is achieved by selecting examples with (presumably) high utility for the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. K-CAP’09, September 1–4, 2009, Redondo Beach, California, USA. Copyright 2009 ACM 978-1-60558-658-8/09/09 ...$10.00 classifier training. AL has already been shown to meet these expectations in a variety of NLP tasks [6, 10, 15, 19]. Machine learning (ML) approaches, however, often face a problem with skewed training data, in particular, when it is drawn from already imbalanced ground data. This primary bias can be observed for many NLP tasks such as named en- tity recognition (NER). Here, imbalance between the differ- ent entity classes occurs especially when semantically gen- eral classes (such as person names) are split into more fine- grained and specific ones (actors, politicians, sportsmen, etc.). Since rare information carries the potential to be particularly useful and interesting, performance might then be tuned – to a certain extent – in favor of minority classes at the danger of penalizing the overall outcome. Class imbalance and the resulting effects in learning classi- fiers from skewed data have been intensively studied in re- cent years. Common ways to cope with skewed data include different re-sampling strategies and cost-sensitive learning [11, 3, 5]. It has been argued that AL can also be used to leverage class imbalance: The class imbalance ratio of data points close to the decision boundaries is typically lower than the imbalance ratio in the complete data set [7] so that AL provides the learner with more balanced classes. The fo- cus of this paper is whether this natural characteristic of AL can be intensified to obtain even more balanced data sets. We compare four approaches to reduce the class imbalance up front during AL-driven data acquisition for NER. Sec- tion 2 contains a brief sketch of our approach to AL for NER and in Section 3 we present four alternative ways to re- duce class imbalance during AL. Related work is discussed in Section 4. We experimentally evaluate the methods un- der scrutiny on two data sets from the biomedical domain (Section 5) and discuss the results in Section 6. 2. AL FOR NER AL is a selective sampling technique where the learning pro- tocol is in control of the data to be used. The goal of AL is to learn a good classifier with minimal human labeling ef- fort. The class labels for examples which are considered most useful for the classifier training are queried iteratively from an oracle – typically a human annotator. In our sce-