2380 Anais do XIX Congresso Brasileiro de Automática, CBA 2012. ISBN: 978-85-8001-069-5 PERFORMANCE ESTIMATION FOR CLUSTERING ALGORITHMS WITH META-LEARNING TECHNIQUES DANIEL G. FERRARI, LEANDRO N. DE CASTRO LCON – NATURAL COMPUTING LABORATORY MACKENZIE UNIVERSITY SAO PAULO, SP, BRAZIL Emails: ferrari.dg@gmail.com, lnunes@mackenzie.br Abstract An important open problem within the machine learning field is the choice of which algorithm to apply to each problem at hand. This is very much true for the many data mining tasks, such as clustering, classification, and others. Each set of problems may or may not have a suitable number and type of attribute for the application of one algorithm in detriment of another. Meta-learning is a research approach that aims to extract features from datasets’ attributes and use these features to assess the attributes’ influence over the performance of algorithms. As a result, the user would be saved from the burden of having to try and evaluate each algorithm for a new problem. Meta-learning has been broadly investigated in the context of classification, but there are few works that address meta-learning for clustering tasks. Therefore, the present paper emphasizes the use of meta-learning for the performance estimation of clustering algorithms based on the feature extraction of unlabelled objects. The features of the clustering problems will be evaluated along with the performance of different algorithms so that the meta- learning system can accurately estimate the performance of the algorithms for a new problem. Keywords clustering, meta-learning, performance estimation. 1 Introduction There is currently a huge amount of information represented and stored as data to posterior analysis (Xu & Wunsch, 2005). Researchers began to dedicate themselves to the development of methods to extract knowledge from data; the process of applying these methods is known as data mining (Fayyad et al., 1996). Nowadays, data mining tools are characterized by a variety of algorithms able to solve each one of the many data mining tasks. However, this process suffers from the lack of guidelines to select the best algorithm to solve a given problem (Brazdil et al., 2009). The meta-learning field of research has as objective to find which problem features contribute to a better or worse performance of an algorithm (Giraud-Carrier et al., 2004), and, from this, recommend the most appropriate algorithm for solving a given problem (Brazdil et al., 2009). To reach this objective, meta-learning builds two key sets: 1) Meta-attributes: the set of features that are common to several instances of a class of problems, such as the number of objects and the number of binary attributes, among others; 2) Performance: the set of performances of several algorithms applied to the same problems. From these sets it is created a model to predict the performance of the algorithms when applied to other problems, not used for training, based on the meta- attributes proposed. The connection between data mining and meta- learning has been widely investigated for classification tasks (Brazdil et al., 2009; Smith- Miles, 2009; Michie et al., 1994). However, few studies are available in the literature for clustering tasks (de Souto et al., 2008; Soares et al., 2009; Nascimento et al., 2009). For instance, there is no study about which will be the best feature set for unsupervised learning problems, like clustering (Brazdil et al., 2009). The experiments performed in this work aim at investigating the performance estimation of a number of algorithms for clustering problems based on meta- attributes described in the literature for classification problems. Despite that, the features to be selected here will not require the data labels, thus making our methodology generic for clustering tasks. This work is organized as follows. Section 2 presents a brief theoretical background on meta- learning and clustering. Section 3 explains the methodology used in the experiments and shows the results. The paper is concluded in Section 4 with a discussion about the results and future investigations. 2 Theoretical Background 2.1 Meta-Learning Meta-learning is learning about learning, i.e., one must learn about the behavior of machine learning algorithms in order to find out the best algorithm for each problem (Aha, 1992). In 1994, the EU ESPRIT project StatLog (Michie et al., 1994) extended this concept with the objective of relating the performance of the algorithm with the features of objects for classification problems. Meta-learning is tightly connected with the process of extracting and exploiting the meta- knowledge, which can assume different forms and be defined as any kind of knowledge that can be extracted from the learning process of an algorithm whilst being applied to a problem (Giraud-Carrier et al., 2004).