Learning with few examples: an empirical study on leading classifiers Christophe Salperwyck and Vincent Lemaire Abstract— Learning algorithms proved their ability to deal with large amount of data. Most of the statistical approaches use defined size learning sets and produce static models. However in specific situations: active or incremental learning, the learning task starts with only very few data. In that case, looking for algorithms able to produce models with only few examples becomes necessary. The literature’s classifiers are generally evaluated with criterion such as: accuracy, ability to order data (ranking)... But this classifiers’ taxonomy can dramatically change if the focus is on the ability to learn with just few examples. To our knowledge, just few studies were performed on this problem. The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases). I. I NTRODUCTION Learning machines have shown their ability to deal with huge volumetry on real problems [1], [2]. Nevertheless most of the works were realized for data analysis on homogeneous and stationary data. Usually learning machines use data set with fixed sizes and produce static models. However in certain situations, the learning task starts with only few data. In such cases finding algorithms able to produce accurate models with few data and low variance is an advantage. Active and incremental learning are the two main learning problems where a learning machine able to learn with few data is necessary. This study only focuses on supervised learning. Active learning [3] is used when lots of data are available but labeling them is expensive (indeed labels are bought). In that case the goal is to select the smallest amount of data which will provide the best model. These data are expected to be very expressive and an algorithm able to deliver an accurate model with just few data is needed in order to avoid buying more data. Incremental learning [4] start to learn with few exam- ples as it theoretically has to learn from the first provided examples. The model is then improved as new examples are arriving. The model quality at the beginning depends on the algorithm capacity to learn fast with few examples. Incremental learning research started a long time ago but it recently reappears with data stream mining. Indeed numerous software are generating data streams: sensor networks, web pages access logs... These data arrive fast and are only visible once. Therefore it is mandatory to learn them as soon as they are arriving (on-line learning). Incremental learning appears to be a natural solution to solve streams problems. Authors are in the group ’Profiling and Datamining’, Orange Labs, 2 avenue Pierre Marzin, 22300 Lannion, France (phone: +33 296 053 107; email: firstname.name@orange-ftgroup.com). An example is Hoeffding trees [5] which are widely used in incremental learning on data streams. The tree construction is incremental and nodes are transformed into leaves as examples are arriving. Having a classifier in the tree leaves [6] before they will be transformed, appears to improve the tree accuracy. A classifier that can learn with few data will provide a pertinent local model in the leaves. The most used classifiers such as decision tree, neural network, support vector machine... are often evaluated with criterion such as accuracy, ability to rank data... But this classifiers taxonomy can be completely different if the focus is on their ability to learn with just few examples. To our knowledge, the state of art presents only few studies on the learning performance versus the size of the learning data set: in [7] the performance on small and not balanced text datasets is studied using 3 different classifiers (Support Vector Machine (SVM), naive Bayes and logistic regression). In [8], the authors focus on the learning time contrary to this study which focus on the performance versus the size of the training set. In [9] and in [10] the focus is respectively on Parzen Windows and k nearest neighbor. In [11] the construction of linear classifiers is considered for very small sample sizes using a stability measure. In [12] the influence of the training set size is evaluated on 4 computer- aided diagnosis problems using 3 different classifiers (SVM, C4.5 decision tree, k-nearest neighbor). In [13] the authors look at how increasing data set size affects bias and variance error decompositions for classification algorithms. The con- clusions of these papers will be compared to the results of this empirical study at the end of this paper. The present work aims to study a larger panel both of learning algorithms (9 different kinds) and data sets (17 from UCI). In a first part (section II) we will present the classifiers that will be used in this study and their parameters. The experimental protocol will be presented section III: data sets, split between training and test sets, evaluation criterion. Section IV will present the results and analyze them depending on the typology of the classifiers. In the last part we will conclude and propose future works related to this study. II. LEARNING SPEED AND CLASSIFIERS TYPOLOGY A. Learning speed Firstly this study does not focus on the bounds or conver- gence time for a classifier given a training set of n examples. This would correspond to determine the CPU time needed for this classifier to learn n examples. In this study the “learning speed” means the (minimum) number or training examples