MetaBags: Bagged Meta-Decision Trees for Regression Jihed Khiari NEC Laboratories Europe jihed.khiari@neclab.eu Luis Moreira-Matias NEC Laboratories Europe luis.moreira.matias@gmail.com Ammar Shaker NEC Laboratories Europe ammar.shaker@neclab.eu Bernard Ženko Josef Stefan Institute bernard.zenko@ijs.si Sašo Džeroski NEC Laboratories Europe saso.dzeroski@ijs.si ABSTRACT Ensembles are popular methods for solving practical supervised learning problems. They reduce the risk of having underperforming models in production-grade software. Although critical, methods for learning heterogeneous regression ensembles have not been proposed at large scale, whereas in classical ML literature, stacking, cascading and voting are mostly restricted to classifcation problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel, practically useful stacking framework for regression. MetaBags is a meta-learning algorithm that learns a set of meta-decision trees designed to select one base model (i.e. expert ) for each query, and focuses on induc- tive bias reduction. A set of meta-decision trees are learned using diferent types of meta-features, specially created for this purpose. Each meta-decision tree is learned on a diferent data bootstrap sample, and, given a new example, selects a suitable base model that computes a prediction. Finally, these predictions are aggregated into a single prediction. This procedure is designed to learn a model with a fair bias-variance trade-of, and its improvement over base model performance is correlated with the prediction diversity of diferent experts on specifc input space subregions. The proposed method and meta-features are designed in such a way that they enable good predictive performance even in subregions of space which are not adequately represented in the available training data. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on synthetic, open and real-world application datasets. The ob- tained results show that our method signifcantly outperforms ex- isting state-of-the-art approaches. KEYWORDS Stacking, Regression, Meta-Learning, Landmarking. ACM Reference Format: Jihed Khiari, Luis Moreira-Matias, Ammar Shaker, Bernard Ženko, and Sašo Džeroski. 2018. MetaBags: Bagged Meta-Decision Trees for Regression. In Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Submitted to ECML/PKDD 2018, , © 2018 Copyright held by the owner/author(s). ACM ISBN 123-4567-24-567/08/06. . . $15.00 https://doi.org/10.475/123_4 Proceedings of (Submitted to ECML/PKDD 2018). ACM, New York, NY, USA, Article 4, 11 pages. https://doi.org/10.475/123_4 1 INTRODUCTION Ensemble refers to a collection of several models (i.e., experts) that are combined to address a given task (e.g. obtain a lower gener- alization error for supervised learning problems) [24]. Ensemble learning can be divided in three diferent stages [24]: (i) base model generation, where z multiple possible hypotheses ˆ f i (x ), i ∈{1..z } to model a given phenomenon f (x ) = p( y |x ) are generated; (ii) model pruning, where c z of those are kept and the others discarded; and (iii) model integration, where these hypotheses are combined to form the fnal one, i.e. ˆ F ( ˆ f 1 (x ), ..., ˆ f c (x ) ) . Naturally, the whole process may require a large pool of computational resources for (i) and/or large and representative training sets to avoid overftting, since ˆ F is also estimated/learned on the (partial or full) training set, which was already been used to train the base models ˆ f i (x ) in (i). Since the pioneering Netfix competition in 2007 [1] and the coinci- dent introduction of cloud-based solutions for data storing and/or large-scale computing purposes, ensembles have been increasingly often used for industrial applications. A good illustration of such a trend is Kaggle, the popular competition website, where, during the last fve years, 50+% of the winning solutions involved at least one ensemble of multiple models [21]. Ensemble learning builds on the principles of committees, where there is typically never a single expert that outperforms all the others on each and every query. Instead, we may obtain a better overall performance by combining answers of multiple experts [28]. Despite the importance of the combining function ˆ F for the success of the ensemble, most of the recent research on ensemble learning is either focused on (i) model generation and/or (ii) pruning [24]. We can group diferent approaches for model integration in three clusters [30]: (a) voting (e.g. bagging [4]), (b) cascading [18] and (c) stacking [32]. In voting, the outputs of the ensemble is a (weighted) average of outputs of the base models. Cascading iteratively com- bines the outputs of the base experts by including them, one of a time, as yet another feature in the training set. Stacking learns a meta-model that combines the outputs of all the base models. All these approaches have advantages and shortcomings. Voting relies on base models to have some complementary expertise 1 , which is an assumption that is rarely true in practice (e.g. check Fig. 1-(b,c)). 1 Some base models perform reasonably well in some subregion of the feature space, while other base models perform well in other regions. arXiv:1804.06207v1 [cs.LG] 17 Apr 2018