1063-6706 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2017.2688358, IEEE Transactions on Fuzzy Systems TFS-2016-0830 1 Abstract—Features that have good predictive power for classes or output variables, are useful features and hence most feature selection methods try to find them. However, since there may be high correlation or nonlinear dependency between such good features, we may obtain a comparable performance even when we use only a few of those good features. Thus a feature selection method should select useful features with controlled redundancy. In this paper, we propose a novel learning method that imposes a penalty on the use of dependent/correlated features during system identification along with feature selection. This feature selection scheme can choose good features, discard indifferent and derogatory features, and can control the level of redundancy in the set of selected features. This is probably the first attempt to feature selection with redundancy control using a fuzzy rule based framework. We have demonstrated the effectiveness of this method by utilizing a 10-fold cross validation setup on a synthetic data set as well as on several commonly used data sets for classification problems. We have also compared our results with some state-of-the art methods. Keywords: useful features, feature dependency, penalty, feature selection with controlled redundancy, fuzzy rule based system. I. INTRODUCTION he present era has been witnessing a rapid growth of data in different forms such as protein sequences, gene sequences, gene expression profiles, satellite images, online transactions, surveillance videos, functional MRIs, and many others. These data pose new problems because of size, heterogeneity, and time varying characteristics. Consequently, they demand specialized modeling and analysis tools. Machine learning and computational intelligence tools have been extensively used for analyzing such data. Usually such data sets have many indifferent, correlated/dependent, and derogatory features. More features demand more data acquisition time and cost, and more complex design mechanisms. Additional features also result in more design and decision making time. Hence reducing the dimensionality through feature selection, if possible, is always desirable. The literature is quite rich with various feature selection methods, yet, there is a strong need for new and improved feature selection methods, which can deal with today’s very complex and high dimensional data [1] and Manuscript received August 30, 2016; revised November 14, 2016; accepted March 5, 2017. The first author was partially supported by the Ministry of Science and Technology, Taiwan, under Grant MOST 105-2628-E-010-002-MY2. I. F. Chung is with the Institute of Biomedical Informatics, and also with the Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei 112, Taiwan (e-mail: ifchung@ym.edu.tw). Y. C. Chen is with the Software Product Center, Wistron Corporation, New Taipei City, Taiwan (e-mail: weasley001@gmail.com). N. R. Pal is with the Electronics and Communication Sciences Unit, Indian Statistical Institute, Calcutta 700108, India (e-mail: nikhil@isical.ac.in). select useful features in an efficient manner. In [2] a concept of feature attenuating gate was proposed for feature selection using a multilayer perceptron neural network. In our recent studies we have developed integrated schemes for feature selection and fuzzy rule extraction for classification as well as function approximation/prediction problems using the concept of gates [3]-[5]. These methods although can select useful features and discard bad and indifferent features, they cannot control the use of redundant features. To our knowledge, there has been no study to select useful features with a control on redundancy in an integrated manner while designing a fuzzy rule based system. Here we want to eliminate this shortcoming. To achieve this, we have proposed a new learning scheme which uses a regularizing term to impose a penalty for use of redundant features. This scheme can select useful features with controlled redundancy and can discard derogatory and indifferent features. Consequently it can greatly reduce dimensionality, and save cost, time and effort. Generally, feature selection methods fall into three broad groups, namely, filter method, wrapper method, and embedded method [6][7]. A filter method ranks features independent of the classifier (or prediction system) that will use the features. This kind of method prioritizes selection of features with high relevance (or dependency) to target output. A wrapper method usually performs a heuristic search in the feature space and selects a subset of features evaluating its effectiveness according to the predictive performance of the associated prediction system. On the other hand, an embedded method assigns weights representing importance of features using the internal parameters of the prediction system – this is an integrated approach where the feature weighting/selection and design of the prediction model are done simultaneously. Since wrapper and embedded methods can capture non-linear interactions between features as well as interaction between features and the tools, these two methods are more likely to exhibit better predictive performance. But, filter methods usually have lower computational complexity; while of the remaining two, usually embedded methods have much lower computational complexity than wrapper methods as unlike wrapper methods, embedded methods do not need to explicitly evaluate different subsets of features. When addressing classification or regression problems, most feature selection methods do not explicitly try to avoid the selection of redundant good features because of good predictive power of such informative features for classes or output variables. However, since these good redundant features have high correlations or nonlinear dependencies with other useful features [8], we can more efficiently obtain comparable performance for classification or regression tasks by choosing only a few of those good features as the system input. For example, a microarray gene expression data set with a large Feature Selection with Controlled Redundancy in a Fuzzy Rule Based Framework I-Fang Chung, Member, IEEE, Yi-Cheng Chen, and Nikhil R. Pal, Fellow, IEEE T