A Greedy Algorithm for Selecting Models in Ensembles Andrei L. Turinsky University of Calgary, Canada aturinsk@ucalgary.ca Robert L. Grossman University of Illinois at Chicago, USA and Open Data Partners, USA grossman@uic.edu Abstract We are interested in ensembles of models built over k data sets. Common approaches are either to combine models by vote averaging, or to build a meta-model on the outputs of the local models. In this paper, we consider the model assignment approach, in which a meta-model selects one of the local statistical models for scoring. We introduce an algorithm called Greedy Data Labeling (GDL) that improves the initial data partition by reallocating some data, so that when each model is built on its local data subset, the resulting hierarchical system has minimal error. We present evidence that model assignment may in certain situations be more natural than traditional ensemble learning, and if enhanced by GDL, it often outperforms traditional ensembles. 1. Introduction In a standard approach to data mining, a learning algorithm F is applied to a single data set D, where D consists of instances of the form (x,y), with x a vector of data attributes and y a class label. The choice of the algorithm defines a parametric family of predictive models. Given the algorithm F and the dataset D, a predictive model f:x→y is built to predict the class value of unlabeled data instances. Ensembles of models arise in several different ways. First, they can be built by taking a single data set and using sampling with replacement to produce k separate data sets [1]. Second, ensembles arise naturally in distributed data mining, where the data may be naturally distributed over k geographical sites. While it may be possible to move all data to a central site and build a single model there, it is often too costly or otherwise impractical. The alternative is to mine all data in-place and thus build k predictive models (base-models) locally. Third, ensembles of models arise naturally in hierarchical modeling. Here the outputs of one or more models are used as the inputs to another model [2]. Hierarchical modeling arises naturally when data comes from different underlying distributions, in which case a collection of specialized models is desirable. Consider an (idealieze) application where a separate model predicts the risk of a heart disease for each age group. When a new patient arrives, the meta-model will invoke the predictive model that corresponds to the patient's age. There are several ways to produce a single score from an ensemble of models. The simplest is a voting/averaging ensemble [3], in which case the meta- model is just an averaging function. A more complex one is meta-learning, where individual base-models make predictions, after which the meta-model reads their scores and makes the overall prediction [4]. A third technique is model assignment, or model selection, where the meta-model delegates the scoring of an unlabeled data instance to one of the base-models, as in the example with the heart disease prediction above. Consider k data sets D 1 ,…, D k , each represented by a different color. Denote by D their union as a bag (i.e. points may occur with multiplicity). A model assignment system is created as follows. A base-model f i is built on each data subset D i using a learning algorithm F. Then a sample of data from each color is used to train a meta- model that can predict colors. In this paper, we assume that we are given the k data sets D 1 ,…, D k , but have the flexibility to move data between them. This assumption is natural for the cases of distributed data mining and hierarchical modeling mentioned above. We introduce a novel algorithm, called Greedy Data Labeling (GDL), for learning base-models on D 1 ,…, D k and a meta-learner for model selection. The algorithm relies on moving small amounts of data between the various data sets D i . We give some preliminary experimental studies showing that in interesting cases involving heterogeneous data it outperforms traditional ensemble learning based upon voting. Since one of our interests is in exploring distributed data partitions, in this paper, we consider the special case in which data instances belong to only one color at a time. Naturally, the GDL algorithm depends on the initial distribution of data into the subsets {D i }. This reflects the underlying structure of both distributed data mining and hierarchical modeling of heterogeneous data.