IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.1, January 2008 234 Manuscript received January 5, 2008 Manuscript revised January 20, 2008 Autonomous agent for gathering information to build focused index from distributed environment Rushdi A. Hamamreh Department of Computer Engineering, Al-Quds University Abstract: In this paper describes architecture of autonomous agent that gathering information from distributed environment as Internet to build sub-specific collection, and to extract information from documents used latent semantic indexing algorithm and two filters one for collection and the other for queries . Keywords: Internet autonomous agent, distributed systems, Collections, latent semantic index, thematic filter. 1. Introduction It’s well known that search engines with centralized architecture can’t index the whole Internet because the exponential growth of the number of documents published in the Internet. Search engine with distributed architecture is scalable solution of this problem. In the framework of this architecture we use a set of subject specific collections of electronic documents published in the Internet. These collections belong to different owners who are responsible for their content, indexing and quality of search. User’s query is automatically propagated to one or more collections with topics relevant to the query topic[1,2,4,10]. Conventional method of generation of subject-specific collection is preparation of the collection core which consists of a relatively small set of documents relevant to the collection topic. Administrator of the collection is responsible for preparation of this core. After that we can use information agent whose goal is to scan Internet and seek documents relevant to the collection core. Usually collection filter is used to filter documents relevant to the collection topic. This filter is based on the analysis of the collection core. In this paper we propose to use additional filter based on the analysis of archive of user’s queries previously received by this collection. This queries reflect information needs of the whole community of users and information agent should take into account these information needs. So a new document is recommended to the collection by our information agent if it’s accepted by filter based on analysis of the collection core or by filter based on analysis of the user’s queries archive. 2. Architecture of the autonomous agent Our agent contains the following main components: · Analyzer of the collection content The goal of this component is to analyze the whole set of documents from this collection and create the collection description which reflects the main subjects presented in this collection. We’ve used for this propose probabilistic latent semantic indexing [3,5]. The goal of the latent semantic indexing is extraction of latent factors which reflect a set of narrow topics presented in the given collection. Let } ,..., { 1 k z z Z z = Î be set of these factors. Let denote · ) ( i z P – probability that randomly selected document from the collection best of all corresponds to the topic i z ( see Eq. 2). · ) | ( z d P – probability that for the given factor i z this factor best of all corresponds to the document j d ( see Eq. 3). · ) | ( z w P – probability that for the given factor i z this factor best of all corresponds to the word j w (see Eq. 4). Here } ,..., { 1 n d d D = is set of all documents from the collection and } ,..., { 1 m w w W = is set of all words from this collection. Functions ) ( z P , ) | ( z d P and ) | ( z w P can be estimated in the process of a likelihood function maximization. This function is presented in the following form