Learning and Nonlinear Models - Revista da Sociedade Brasileira de Redes Neurais (SBRN), Vol. 6, No. 1, pp. 29-43, 2008
©Sociedade Brasileira de Redes Neurais
THE PROPOSAL OF TWO BIO-INSPIRED ALGORITHMS
FOR TEXT CLUSTERING
Ana Karina F. Prior
1,2
Leandro Nunes de Castro
1,2
Leandro R. de Freitas
1
Alexandre Szabo
2
1
NatComp - From Nature to Business, Rua do Comércio 44, sala 03, Centro, Santos - SP – Brazil.
2
Mackenzie University, Rua da Consolação, 896, São Paulo – SP - Brazil.
Emails: [anakarina.prior, leandrorubim ,alexandreszabo]@gmail.com, lnunes@mackenzie.br
Abstract – The Internet can be seen as a major repository of resources and information. The growing demand for
information, along with the large amount of data available, has been stimulating the research of methods for text mining. This
work aims at using feature selection and text clustering techniques based on a Particle Swarm Clustering (PSC) algorithm and
on an Artificial Neural Network modeled as a competitive and constructive Antibody Network, called RABNET (Real-valued
Antibody Network), to show that both techniques present relevant results when applied to text clustering problems.
Keywords: Text Mining, Text Clustering, PSC, RABNET, Artificial Immune Systems, Swarm Intelligence.
1 Introduction
Text mining (Weiss et al., 2005), (Hotho et al., 2005) is a field of research within the data mining (Han et al., 2000) area that
has been receiving a great deal of attention over the past years. This is mainly due to the growing need to automatically analyze
text data, including data available in the Internet, since the overload of text information hinders its manual analysis, location
and access. Text mining can be understood as the process of extracting interesting, non-trivial and useful patterns (knowledge)
from text documents (Weiss et al., 2005). Several important classes of problems can be solved using text mining, such as
document classification, clustering, and information retrieval.
The main objective of this work is to apply both an artificial immune network algorithm, named RABNET (Real-valued
Antibody Network) (Knidel et al., 2005), and a swarm intelligence algorithm, named PSC (Particle Swarm Clustering) (Cohen
& Castro, 2006) so that they can be applied to cluster text data. In order to assess the performance of both algorithms, they
were implemented and applied to cluster three benchmark text corpora. The performance measures used to evaluate and
compare the algorithms were Entropy and Purity (Zhao & Karypis, 2004) and the algorithm used for benchmarking was the
well known k-means clustering (Chakrabarti, 2003).
This paper is organized as follows. Section 2 provides a brief overview of bio-inspired computing and Section 3 briefly
describes related works. Section 4 introduces the PSC algorithm and Section 5 introduces the RABNET algorithm. Section 6
provides a brief description of the similarity measure used. Section 7 provides information about the datasets used for
assessment, a discussion about the sensitivity of the algorithms to its tunable parameters, the experiments performed, followed
by a discussion of the results obtained. The paper is concluded in Section 8.
2 Bio-Inspired Computing
For years, many computational models, studied by scientists and engineers, brought huge benefits and solutions to humanity.
Despite these technological advances, solutions to well known and complex problems, such as autonomous navigation and
knowledge discovery from large databases were still scarce or inadequate. Many problems remained unsolved or poorly
solved, opening the field for research of innovative solutions. Bio-inspired Computing is one of the three categories of the
broad field of Natural Computing (de Castro, 2006), which emerged from the idea of using nature’s inspiration to search for
new computational solutions to complex problems. It is based on living organisms, their processes and behaviors evolved over
thousands of years, such as self-organization, mechanisms of survival and adaptation. Inspired by nature, researchers see the
possibility of creating computational models based on biological processes and phenomena. Natural computing thus involves
the extraction of ideas from nature to the design of nature-inspired algorithms for solving complex problems. This field of
research can be divided into three categories (de Castro, 2007):
1. Computing inspired by nature : using nature as inspiration for the development of techniques (tools) for solving
complex problems. This branch involves techniques such as Neural Networks, Evolutional Computing, Swarm
Intelligence and Artificial Immune Systems.