Data Mining in Incomplete Numerical and Categorical Data Sets: a Neuro-Fuzzy Approach P. Rey-del-Castillo 1 , and J. Cardeñosa 2 Abstract - There are many applications dealing with incomplete data sets that take different approaches to making imputations for missing values. Most tackle the problem for numerical input variables in the data set. However, when there are two types of input variables, numerical and categorical, the state of the art has provided no clear solutions. This paper presents a proposal for handling incomplete numerical and categorical data sets using an extension of an existing neuro-fuzzy approach. The method is extensively tested in a real environment in the field of the political election polls. Keywords: Data mining; Fuzzy systems; Incomplete data sets ; Neural networks. 1 Introduction One of the tasks involved in a data mining process is classification. Classification can be defined as a procedure in which individual items are assigned to categories. One specific classification problem is when items in data sets contain categorical and numerical data. Because of the appeal of using simple rules that do not take a lot of effort to construct, fuzzy control systems have been used for the purpose of classification from the earliest days of fuzzy logic [1], [2]. These systems usually generate a rule for each classification category, specifying the rule’s antecedent from fuzzy sets defined over the input variables set. The rules are easy to specify when there are not many categories, but this gets harder as the number grows. To overcome this problem some hybrid approaches have been proposed to ease the learning of the fuzzy rules. These hybrid procedures are mainly based on the combination of fuzzy set theory with other methodologies, like evolutionary algorithms and neural networks. Neuro-fuzzy computation is one of the most popular hybridizations in the artificial intelligence literature because it combines the merits of the neural and fuzzy approaches. It has the generic benefits of neural networks –like massive parallelism and robustness– and, at the same time, uses fuzzy logic to model vague or qualitative knowledge and convey uncertainty [3]. The fuzzy min-max neural network classifier is a supervised learning method that takes a hybrid approach combining neural networks and fuzzy systems. Simpson’s original fuzzy min-max neural networks model [4], [5] was later modified and improved by Gabrys and Bargiela [6], [7]. The new version offers a new approach to dealing with incomplete data sets. Because of its good classification performance, there have also been further modifications, i. e. to the fuzzy membership definition [8] and several learning process steps [9], [10], [11]. A characteristic of the fuzzy min-max neural network classifier is that the input variables for learning and classification are numerical. This can be a very restrictive feature in many real-world applications. In this article we will present a method that extends the input to categorical variables by introducing new fuzzy sets, new procedures and a new architecture, allowing for greater flexibility and wider application. The method also straightforwardly extends the treatment of the missing input variable values. The new method has been tested on a real case of non-response imputation in an opinion poll. Missing information in data sets is a more than common scenario. A frequently used procedure to deal with this problem is to replace each missing variable value with an estimated value or imputation obtained from the values of other variables in the same unit. The microdata (the set of the respondents’ individual answers to the questions) in opinion polls are especially suited for evaluating the method, since they include a great many numerical and categorical attributes. The paper is organized as follows. Section 2 presents the architecture and operation of fuzzy min-max neural networks as a starting point for the new classifier. Section 3 shows the new fuzzy set-based method used to define new networks and their architecture and operation. Section 4 describes the context of the imputation problem to be solved using the new method and presents the experiment results. These are compared with the outcomes of traditional methods applied to the same data sets, resulting in some improvements as shown in the outlined experiment.