KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo CEISE ISCAP IPP Rua Jaime Lopes de Amorim, s/n 4465 S. M. de Infesta - Portugal Manuel Filipe Santos DSI - UM Campus de AzurØm 4800-058 Guimarªes ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done that seek the establishment of standards in the area. Included on these efforts there can be enumerated SEMMA and CRISP-DM. Both grow as industrial standards and define a set of sequential steps that pretends to guide the implementation of data mining applications. The question of the existence of substantial differences between them and the traditional KDD process arose. In this paper, is pretended to establish a parallel between these and the KDD process as well as an understanding of the similarities between them. KEYWORDS Data Mining Standards, Knowledge Discovery in Databases, Data Mining. 1. INTRODUCTION Fayyad considers Data Mining (DM) as one of the phases of the KDD process (Fayyad et al., 1996). The DM phase concerns, mainly, to the means by which the patterns are extracted and enumerated from data. The literature is a source of some confusion because de two terms are indistinctively used, making it difficult to determine exactly each of the concepts (Benot, 2002). The growth of the attention paid to the area emerged from the rising of big databases in an increasing and differentiate number of organizations. There is the risk of wasting all the value and wealthy of information contained on these databases, unless there are used the adequate techniques to extract useful knowledge (Chen et al, 1996) (Simoudis, 1996) (Fayyad, 1996). Some efforts are being done that seek the establishment of standards in the area, both by academics and by people in the industry field. The academics efforts are centered in the attempt to formulate a general framework for DM (Dzeroski, 2006). The bulk of these efforts are centered in the definition of a language for DM that can be accepted as a standard, in the same way that SQL was accepted as a standard for relational databases (Han et al, 1996) (Meo et al, 1998) (Imielinski et al, 1999) (Sarawagi, 2000) (Botta et al, 2004). The efforts in the industrial field concern mainly the definition of processes/methodologies that can guide the implementation of DM applications. In this paper, SEMMA and CRISP-DM have been chosen, because they are considered to be the most popular. Although it is not scientific this perception exists, because SEMMA and CRISP-DM are presented in many of the publications of the area and are really used in practice. During the analysis of the documentation on SEMMA and on CRISP-DM, the question of the existence of substantial differences between them and the traditional KDD process arose. In this paper, it is intended to establish a parallel between these and the KDD process as well as an understanding of the similarities between them. The paper begins, on section 2, by presenting KDD, SEMMA and CRISP-DM. Next, on section 3, a comparative study is done, presenting the analogies and the differences between the three processes. Finally, on section 4, conclusions and future work are presented. ISBN: 978-972-8924-63-8 ' 2008 IADIS 182