Learning and Nonlinear Models - Revista da Sociedade Brasileira de Redes Neurais (SBRN), Vol. 6, No. 1, pp. 29-43, 2008 ©Sociedade Brasileira de Redes Neurais THE PROPOSAL OF TWO BIO-INSPIRED ALGORITHMS FOR TEXT CLUSTERING Ana Karina F. Prior 1,2 Leandro Nunes de Castro 1,2 Leandro R. de Freitas 1 Alexandre Szabo 2 1 NatComp - From Nature to Business, Rua do Comércio 44, sala 03, Centro, Santos - SP – Brazil. 2 Mackenzie University, Rua da Consolação, 896, São Paulo – SP - Brazil. Emails: [anakarina.prior, leandrorubim ,alexandreszabo]@gmail.com, lnunes@mackenzie.br Abstract The Internet can be seen as a major repository of resources and information. The growing demand for information, along with the large amount of data available, has been stimulating the research of methods for text mining. This work aims at using feature selection and text clustering techniques based on a Particle Swarm Clustering (PSC) algorithm and on an Artificial Neural Network modeled as a competitive and constructive Antibody Network, called RABNET (Real-valued Antibody Network), to show that both techniques present relevant results when applied to text clustering problems. Keywords: Text Mining, Text Clustering, PSC, RABNET, Artificial Immune Systems, Swarm Intelligence. 1 Introduction Text mining (Weiss et al., 2005), (Hotho et al., 2005) is a field of research within the data mining (Han et al., 2000) area that has been receiving a great deal of attention over the past years. This is mainly due to the growing need to automatically analyze text data, including data available in the Internet, since the overload of text information hinders its manual analysis, location and access. Text mining can be understood as the process of extracting interesting, non-trivial and useful patterns (knowledge) from text documents (Weiss et al., 2005). Several important classes of problems can be solved using text mining, such as document classification, clustering, and information retrieval. The main objective of this work is to apply both an artificial immune network algorithm, named RABNET (Real-valued Antibody Network) (Knidel et al., 2005), and a swarm intelligence algorithm, named PSC (Particle Swarm Clustering) (Cohen & Castro, 2006) so that they can be applied to cluster text data. In order to assess the performance of both algorithms, they were implemented and applied to cluster three benchmark text corpora. The performance measures used to evaluate and compare the algorithms were Entropy and Purity (Zhao & Karypis, 2004) and the algorithm used for benchmarking was the well known k-means clustering (Chakrabarti, 2003). This paper is organized as follows. Section 2 provides a brief overview of bio-inspired computing and Section 3 briefly describes related works. Section 4 introduces the PSC algorithm and Section 5 introduces the RABNET algorithm. Section 6 provides a brief description of the similarity measure used. Section 7 provides information about the datasets used for assessment, a discussion about the sensitivity of the algorithms to its tunable parameters, the experiments performed, followed by a discussion of the results obtained. The paper is concluded in Section 8. 2 Bio-Inspired Computing For years, many computational models, studied by scientists and engineers, brought huge benefits and solutions to humanity. Despite these technological advances, solutions to well known and complex problems, such as autonomous navigation and knowledge discovery from large databases were still scarce or inadequate. Many problems remained unsolved or poorly solved, opening the field for research of innovative solutions. Bio-inspired Computing is one of the three categories of the broad field of Natural Computing (de Castro, 2006), which emerged from the idea of using nature’s inspiration to search for new computational solutions to complex problems. It is based on living organisms, their processes and behaviors evolved over thousands of years, such as self-organization, mechanisms of survival and adaptation. Inspired by nature, researchers see the possibility of creating computational models based on biological processes and phenomena. Natural computing thus involves the extraction of ideas from nature to the design of nature-inspired algorithms for solving complex problems. This field of research can be divided into three categories (de Castro, 2007): 1. Computing inspired by nature : using nature as inspiration for the development of techniques (tools) for solving complex problems. This branch involves techniques such as Neural Networks, Evolutional Computing, Swarm Intelligence and Artificial Immune Systems.