758 IEEE SENSORS JOURNAL, VOL. 14, NO. 3, MARCH 2014 Dealing With Redundant Features and Inconsistent Training Data in Electronic Nose: A Rough Set Based Approach Anil Kumar Bag, Bipan Tudu, Nabarun Bhattacharyya, and Rajib Bandyopadhyay Abstract—In many applications of electronic nose, the instru- ment is trained with data generated by human experts prior to its deployment in the ﬁelds. Quite often, these data are conﬂicting and inaccurate and thus the performance of an electronic nose is degraded. Moreover, degradation of its performance may also be due to the presence of redundant features or sensors in the array. While deploying an electronic nose for a speciﬁc application, it is observed that some of the sensors may not be required and only a subset of the sensor array contributes to the decision, which implies that optimization of the sensor array is also very important. To obtain a consistent and precise data set, both the conﬂicting data and irrelevant features must be removed. The rough set theory is capable of dealing with such an imprecise, inconsistent data set and in this paper, the rough-set based algorithm has been applied to remove the conﬂicting training patterns and optimize the sensor array in an electronic nose instrument used for sensing aroma of black tea samples. Index Terms— Black tea, electronic nose, feature selection, reduct, rough set, sensor array. I. I NTRODUCTION A N ELECTRONIC nose [1] comprises of a sensor array with its associated electronic circuitry, an odour delivery system and a pattern recognition unit. The knowledge base in the pattern recognition unit consists of feature information obtained from electrical or other types of sensual responses produced by the sensors in the array. These sensual responses in terms of numerical data pattern contain the signature, which is related to some features of the exposed substance. The sensors in any application speciﬁc instrument like electronic nose, electronic tongue should be sufﬁciently reliable, robust, selective and reversible to guarantee satisfactory classiﬁcation. Unfortunately, limitations in data collection, high complexity of sensory inputs, transient effects, and equipment failure restrict the classiﬁer to be trained by the data to have the desired characteristics in a statistically sufﬁcient way. Manuscript received June 6, 2013; revised September 25, 2013; accepted October 8, 2013. Date of publication October 17, 2013; date of current version January 10, 2014. This work was supported in part by the Department of Science and Technology and in part by the National Tea Research Foundation, Tea Board, Government of India. The associate editor coordinating the review of this paper and approving it for publication was Dr. Ashish Pandharipande. A. K. Bag is with the Department of Applied Electronics and Instrumenta- tion Engineering, Future Institute of Engineering and Management, Calcutta 700150, India (e-mail: anilkumarbag@gmail.com). B. Tudu and R. Bandyopadhyay are with the Department of Instrumentation and Electronics Engineering, Jadavpur University, Calcutta 700098, India (e-mail: bt@iee.jusl.ac.in; rb@iee.jusl.ac.in ). N. Bhattacharyya is with the Centre for Development of Advanced Com- puting, Calcutta 700091, India (e-mail: nabarun.bhattacharya@cdac.in). Digital Object Identiﬁer 10.1109/JSEN.2013.2286110 In many occasions, there may be some redundant features in the training patterns. In addition, the electronic nose while being deployed for a particular application is trained by the data given by the human experts and these patterns often have conﬂicting data due to human error. For example, when the electronic nose is used for evaluation of tea quality, the quality scores assigned by the human tea tasters are the target patterns in the training data set. The quality scores are purely subjective in nature and depend upon the mood and professional acumen of the tea taster. Thus, the training data patterns may contain a number of irrelevant, redundant features and some decision conﬂicting data patterns leading to inconsistency of represen- tation of the information. Such a data set not only increases time complexity, but also degrades classiﬁcation accuracy. As the effective information for classiﬁcation often lies within a lower dimensional feature space, the feature extraction or dimensionality reduction has proven to be a crucial step in all analytical methods or applications [2], [3]. The aim of this work is to develop a strategy based on rough set theory that addresses discovery of relevant features or attributes and ﬁltering in presence of conﬂicting data. Rough set [4]–[6] theory (RST) was proposed by Z. Pawlak in the early 1980s and has received more attention in the domain of artiﬁcial intelligence and cognitive sciences, espe- cially in the spheres of machine learning, knowledge acquisi- tion, knowledge discovery from databases, decision analysis, expert systems, inductive reasoning, data mining [7], [8] and pattern recognition. It also enables creation of classiﬁcation rules from large datasets and has successfully been applied in different ﬁelds like medical diagnosis [9], [10], stock market prediction [11], insurance market analysis [12], etc. In addition, rough set based feature selection has been used on the QSAR (Quantitative structure activity relationship) data set along with support vector machine as the classiﬁer [13]. Another important feature of RST is attribute reduction [14]–[17]. The idea behind the rough set theory is the observation that the presence of uncertainty and impreci- sion in knowledge base induces vague decision and vague- ness may be caused by granularity of representation in the information. Knowledge representation in rough set theory is carried out via information system in a tabular form of OBJECT → ATTRIBUTE VALUE relationship. The tight granularity of representation of information in informa- tion system insists similar objects to be in each equivalent class, which leads to a consistent rule base. Thus, it is important to ﬁlter data in the knowledge base in order to 1530-437X © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.