Ensemble-based wrapper methods for feature selection and class imbalance learning Pengyi Yang 1,3 , Wei Liu 2 , Bing B. Zhou 1 , Sanjay Chawla 1 , and Albert Y. Zomaya 1 1 School of Information Technologies, University of Sydney, NSW 2006, Australia 2 Dept of Computing and Information Systems, University of Melbourne, Australia 3 Garvan Institute of Medical Research, Darlinghurst, NSW 2010, Australia yangpy@it.usyd.edu.au; wei.liu@unimelb.edu.au Abstract. The wrapper feature selection approach is useful in identi- fying informative feature subsets from high-dimensional datasets. Typ- ically, an inductive algorithm “wrapped” in a search algorithm is used to evaluate the merit of the selected features. However, signiﬁcant bias may be introduced when dealing with highly imbalanced dataset. That is, the selected features may favour one class while being less useful to the adverse class. In this paper, we propose an ensemble-based wrapper approach for feature selection from data with highly imbalanced class distribution. The key idea is to create multiple balanced datasets from the original imbalanced dataset via sampling, and subsequently evaluate feature subsets using an ensemble of base classiﬁers each trained on a balanced dataset. The proposed approach provides a uniﬁed framework that incorporates ensemble feature selection and multiple sampling in a mutually beneﬁcial way. The experimental results indicate that, overall, features selected by the ensemble-based wrapper are signiﬁcantly bet- ter than those selected by wrappers with a single inductive algorithm in imbalanced data classiﬁcation. 1 Introduction Feature selection is a critical procedure for high-dimensional data classiﬁcation. The beneﬁts of feature selection are several-fold and dependent on the applica- tions. For creating classiﬁcation models, feature selection can often improve pre- dictive accuracy and comprehensibility [1]. For many bioinformatics applications, feature selection is a critical procedure for identifying important biomarkers [2]. The techniques for feature selection are commonly classiﬁed as ﬁlter ap- proach, wrapper approach, and embedded approach. Filter approach and embed- ded approach are relatively computationally eﬃcient and are commonly applied as a fast feature ranking procedure [3]. In contrast, wrapper approach evaluates features by performing internal classiﬁcation with a given inductive algorithm [4]. Therefore, they are much more computation intensive. Nevertheless, wrapper approach remains attractive for two reasons. Firstly, wrapper approach evaluates features iteratively with respect to an inductive algorithm. Therefore, features