An Evolutionary Algorithm to Optimize Web Document Retrieval André L. Vizine 1 , Leandro N. de Castro 1 & Ricardo R. Gudwin 2 {vizine,lnunes}@unisantos.br; gudwin@dca.fee.unicamp.br 1 Catholic University of Santos, R. Dr. Carvalho de Mendonça, 144, 11070-906, Santos/SP – Brazil; 2 State University of Campinas, C.P. 6101, 13083-852, Campinas/SP – Brazil. Abstract – This paper presents an automatic keyword extraction method and an evolutionary algorithm that mines the web searching for documents according to group users interests. Both techniques were designed for future use in an academic virtual community, character- ized as a scientific paper collection (PDF files) and a means for efficient knowledge and information exchange through the Web. The preliminary results presented here demonstrate that the parts of the system already imple- mented have a good potential for selecting appropriate libraries of keywords and, from them, making and opti- mizing queries for retrieving related documents from the Web. 1. INTRODUCTION The Internet can be seen as a global and distributed re- pository of resources and information. In most cases, these resources are immediately available for use and cover almost all domains, from the support of scientific and educative activities to recreation and entertainment. As a survey made by the UCLA Center Communication Policy, the three main reasons that make new people use or want to use the Internet are: to obtain and retrieve in- formation quickly; professional needs; and communica- tion (e.g., e-mail access) [1]. Moreover, the Internet is reducing the costs of production and distribution of in- formation. As a result, an avalanche of material, in many cases of poor quality, is made available daily in the Web. Despite these benefits, the Internet is not adequately pre- pared for more abstract activities, such as the manage- ment, representation, and other types of information proc- essing and exchange [2]. Along with the amount of information available, the num- ber of people connected to the Internet and the number of web pages accessed have also increased exponentially over the past years. There is a great variety of resources and information available on the web for people with the most diverse background and interests. The major prob- lems of the web, however, are that the bibliographical works available are spread all over the world, the speed with which this information is created and made available, and the poor quality of part of this information. It is thus, the readers’ job to search for and filter out the relevant information. Even qualified users, such as academics (students, researchers and lecturers), do spend time searching and filtering the information retrieved from the Web. Therefore, performing information filtering and flow efficiently becomes a necessary and challenging task. Information filtering systems are designed to filter out the information that a user requests from an enormous amount of information not always of interest [3]. The term information source is used here to represent the site where contents exist and are of interest to the user. These sources are often related to the places where a document collection exists in text form [4]. On one side, the technological advance makes it possible a network infrastructure that supports the most varied types of information resources (e.g., structured multime- dia objects, documents and specialized data bases). On the other side, there is a need to develop client applications that assist the end user in the search, access, organization, and sharing of these information sources. This paper describes a system to autonomously generate group profiles for web documents by selecting a suitable library of keywords, and a search agent that generates and optimizes, via a genetic algorithm (GA), search queries for the Google search engine. The libraries of the group profiles take into account the relative frequency of a word in a given document and its relative frequency in a set of related and unrelated documents [5]; an approach taken to insert context information into the system. The search agent uses a GA to optimize the search of new papers for a group of users instead of a single user. Both techniques were designed to be employed in an academic virtual community in a near future. This community will be char- acterized as a scientific paper collection (PDF files) automatically classified and stored in folder structures of a server and in which academics will be able to exchange experience and knowledge. This article is structured as follows. Section 2 provides a brief overview of information filtering, representation of user profiles and web mining. Section 3 describes the method used for the construction of group profiles. Sec- tion 4 presents the genetic algorithm used for information filtering. Section 5 shows the performance evaluation of the algorithm and the work is concluded in Section 6 with a discussion about future avenues for investigation.