Determining Web Pages Similarity Using Distributed Learning Automata and Graph Partitioning Shahrzad Motamedi Mehr Majid Taran Department of Electrical and Computer Engineering Islamic Azad University, Qazvin Branch Qazvin, Iran motamedi@tmu.ac.ir, m_taran@isc.iranet.net Ali B. Hashemi M. R. Meybodi Computer Engineering and Information Technology Department, Amirkabir University of Technology Tehran, Iran {a_hashemi, mmeybodi}@aut.ac.ir Abstract— Determining similarity between web pages is a key factor for the success of many web mining applications such as recommendation systems and adaptive web sites. In this paper, we propose a new hybrid method of distributed learning automata and graph partitioning to determine similarity between web pages using the web usage data. The idea of the proposed method is that if different users request a couple of pages together, then these pages are likely to correspond to the same information needs therefore can be considered similar. In the proposed method, a learning automaton is assigned to each web page and tries to find the similarities between that page and other pages of a web site utilizing the results of a graph partitioning algorithm performed on the graph of the web site. Computer experiments show that the proposed method outperforms Hebbian algorithm and the only learning automata based method reported in the literature. Keywords-web usage mining, distributed learning automata, web page similarity I. INTRODUCTION World Wide Web has been growing rapidly in recent years and this has resulted in a huge volume of hyperlinked documents which contain no logical organization. Currently, Google indexes more than 3 billions web pages in the world which this number increases with the rate of 7. 3 million pages per day. The massive influx of information onto World Wide Web has facilitated user, not only information retrieval, but also knowledge discovery. However, users are provided with more information and service options, it has become more difficult for them to find the “right” or “interesting” information, the problem commonly known as information overload due to the fact of significantly increasing and rapidly expanding growth in amount of information on the web. Most researches in the data mining field based on document content analysis (data mining content) or graph structure and relation of documents (data mining structure). In addition information obtained from these two methods can be used information about the behavior of users (using Log files on Web servers or applications on the user side) to determine the relationship between the documents [4], the proposal pages [5][12] change the structure of Web site [11], personalized services like [13][14][15], search engine optimization [9] can be used. In [10] application information system is presented in detail. In paper [22] model the problem as Q-Learning while employing concepts and techniques commonly applied in the web usage mining domain. It provides a system which is constantly in the learning process; Does not need periodic updates; can easily adapt itself to changes in website structure and content and new trends in user behavior. In [4] new approach has been presented for solving web usage mining problem. Based on this idea is that if two documents respond to an information necessity, then the two documents are similar. This method assumes that users want to choose next step are aware of the relative of content document. In fact the users create virtual data communication between documents and will surf them. These relationships are user's mental model and may have no physical links. a system that emulates human intuition, so that it can anticipate the desires of its users and provide them with the information they would find most interesting, even when these users cannot explicitly formulate what they are looking for. This is particularly useful for multimedia documents, which do not contain any searchable keywords, and for queries that are as yet ill defined. Therefore a user moves from document i to document j, the only connection between these two documents (a(i,j)) is intensified. Strengthen the connection of two documents i and j corresponding increase in the similarity of these two documents have been considered. The proposed algorithm, the user travel from document i to document j, are not only strengthen the connection between these two documents, But with regard to transitive relation, connections the document i to other documents which the user will surf after j in the path, will be strengthened with a reduction coefficient. Using the ideas discussed in the previous paragraph, in [1][2][3] a method is presented based on self-organizing Distributed Learning Automata to determine the similarity of documents in a digital library. Paper [2] uses a distributed learning automata corresponding to communication graph documents in digital libraries. In this method, the only