Ensemble-based wrapper methods for feature selection and class imbalance learning Pengyi Yang 1,3 , Wei Liu 2 , Bing B. Zhou 1 , Sanjay Chawla 1 , and Albert Y. Zomaya 1 1 School of Information Technologies, University of Sydney, NSW 2006, Australia 2 Dept of Computing and Information Systems, University of Melbourne, Australia 3 Garvan Institute of Medical Research, Darlinghurst, NSW 2010, Australia yangpy@it.usyd.edu.au; wei.liu@unimelb.edu.au Abstract. The wrapper feature selection approach is useful in identi- fying informative feature subsets from high-dimensional datasets. Typ- ically, an inductive algorithm “wrapped” in a search algorithm is used to evaluate the merit of the selected features. However, significant bias may be introduced when dealing with highly imbalanced dataset. That is, the selected features may favour one class while being less useful to the adverse class. In this paper, we propose an ensemble-based wrapper approach for feature selection from data with highly imbalanced class distribution. The key idea is to create multiple balanced datasets from the original imbalanced dataset via sampling, and subsequently evaluate feature subsets using an ensemble of base classifiers each trained on a balanced dataset. The proposed approach provides a unified framework that incorporates ensemble feature selection and multiple sampling in a mutually beneficial way. The experimental results indicate that, overall, features selected by the ensemble-based wrapper are significantly bet- ter than those selected by wrappers with a single inductive algorithm in imbalanced data classification. 1 Introduction Feature selection is a critical procedure for high-dimensional data classification. The benefits of feature selection are several-fold and dependent on the applica- tions. For creating classification models, feature selection can often improve pre- dictive accuracy and comprehensibility [1]. For many bioinformatics applications, feature selection is a critical procedure for identifying important biomarkers [2]. The techniques for feature selection are commonly classified as filter ap- proach, wrapper approach, and embedded approach. Filter approach and embed- ded approach are relatively computationally efficient and are commonly applied as a fast feature ranking procedure [3]. In contrast, wrapper approach evaluates features by performing internal classification with a given inductive algorithm [4]. Therefore, they are much more computation intensive. Nevertheless, wrapper approach remains attractive for two reasons. Firstly, wrapper approach evaluates features iteratively with respect to an inductive algorithm. Therefore, features