Towards a Possibilistic Processing of Missing Values Under Complex Conditions Anas DAHABIAH, John PUENTES, and Basel SOLAIMAN TELECOM Bretagne, Département Image et Traitement de l’Information, Brest, France {anas.dahabiah, john.puentes, basel.solaiman}@telecom-bretagne.eu http://www.telecom-bretagne.eu Abstract: - To estimate the missing values of an attribute in the records of a dataset, all the information provided by the other attributes and the knowledge databases must be considered. However, the information elements could be imperfect (imprecise, possibilistic, probabilistic, etc.) and could have different measuring scales (quantitative, qualitative, ordinal, etc.) at the same time. Furthermore, the relationships and the correlation between the considered attribute and the others should also be pondered. Unlike the prior works that have separately processed these issues using complex and conditional techniques, our approach, essentially based on the tools provided by the possibility theory, can easily handle these aspects within a unified, robust, and simple frameworks. Several numeric examples and applications have been given to simply illustrate the main steps of our method, and some promising perspectives have been proposed at the end of this paper. Key-Words: - Possibility Theory, Missing Data, Information Imperfection (Uncertainty and Ambiguity), Data Mining. 1 Introduction The thorny issue of missing values is a problem that continues to plague data mining and knowledge discovery methods and approaches because the majority of mining techniques and algorithms cannot be applied or implemented due to the attributes that include missing data. A common solution of handling missing values is simply to omit from the analysis the attributes or fields with missing contents. Nonetheless, this may be dangerous, since the pattern of missing values may be systematic, and simply deleting objects with missing values would lead to a biased subset of data [1]. Furthermore, it seems like a waste to omit the information in all the other fields, just because one field value is missing [2]. Therefore, data analysts have turned to methods that would replace the missing value with a value substituted according to various criteria. So far, many methods have been developed to deal with the missing data. These approaches have been classified into two main groups: pre-processing methods and the embedded methods [2]. The first ones replace missing values before the data mining process, whereas the second ones deal with them while doing data mining itself. For instance, in [3] the possibilistic similarity that we have proposed can be seen from a certain point of view as an embedded method, because it doesn’t require estimating the missing values when measuring the similarity between objects. Instead, it takes account of them during the computation when achieving the other tasks like, clustering, recognition, etc. [3][4]. Nevertheless, in many other applications, the need to estimate the missing values can be indispensible and unavoidable. Accordingly, we will propose in the following another approach that estimates the missing values in the pre-processing phase. Unlike the conventional methods usually dedicated to one type of data measuring scale (qualitative, quantitative, binary, etc.) that neglect the imperfection in the information elements (imprecision, uncertainty, ambiguity, etc.), our approach takes account of all the aforementioned points in a unified framework, by applying a simple, fast, flexible technique, fundamentally based on possibility theory. The next section briefly sums up some previous attempts and works in the domain. Section 3 stresses the deficiency of these works, pointing out the need and the importance of a more sophisticated approach that gets use of the monotone fuzzy measures of possibility theory, briefly presented in section 4. At last, we present our approach that step in to fulfil this need in section 5, followed by two illustrative examples in sections 6 and 7, and some conclusions and remarks in sections 8. 2 Prior Missing Data Methods Many methods to deal with missing values have been proposed in the literature. These approaches can be classified into two main categories: the pre-processing and the embedded methods [1][2]. Pre-processing methods replace missing values before the data mining process, while embedded approaches deal with missing values while doing data mining itself. WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Anas Dahabiah, John Puentes, Basel Solaiman ISSN: 1790-0832 562 Issue 4, Volume 7, April 2010