Evolving Ensembles of Feature Subsets towards Optimal Feature Selection for Unsupervised and Semi-supervised Clustering Mihaela Elena Breaban Faculty of Computer Science, Al. I. Cuza University, Iasi, Romania pmihaela@infoiasi.ro Abstract. The work in unsupervised learning centered on clustering has been extended with new paradigms to address the demands raised by real-world problems. In this regard, unsupervised feature selection has been proposed to remove noisy attributes that could mislead the cluster- ing procedures. Additionally, semi-supervision has been integrated within existing paradigms because some background information usually exist in form of a reduced number of similarity/dissimilarity constraints. In this context, the current paper investigates a method to perform simultane- ously feature selection and clustering. The benefits of a semi-supervised approach making use of reduced external information are highlighted against an unsupervised approach. The method makes use of an ensem- ble of near-optimal feature subsets delivered by a multi-modal genetic algorithm in order to quantify the relative importance of each feature to clustering. Key words: unsupervised and semi-supervised learning, clustering, fea- ture selection, feature ranking, ensemble learning 1 Introduction Classification and clustering are two problems intensively studied in machine learning. Classification aims at assigning new data items to existing groupings. Clustering is the problem of identifying natural or interesting groupings in data. Although similar at first sight, they belong to two distinct paradigms: supervised versus unsupervised learning. Feature selection (FS) is a problem of great interest for both scenarios - classification and clustering - with the aim of improving the performance of the corresponding machine learning techniques. Feature ranking is a relaxation of FS: the features are ranked based on their relevance to the problem under investigation. With regard to clustering, fewer approaches exist in literature due to the difficulties raised by the unsupervised nature of the problem; most of them offer feature rankings because the optimal number of features to be selected is hard to be determined in the unsupervised scenario. The current work investigates an extension of a feature ranking technique we have recently proposed in the context of unsupervised clustering [1]. The method