International Journal of Computer Applications (0975 – 8887) Volume 97– No.9, July 2014 43 Design of Web Ranking Module using Genetic Algorithm Vikas Thada Research Scholar Dr.K.N.M. University Newai, India Vivek Jaglan, Ph.D Asst.Prof(CSE),ASET Amity University Gurgaon, India ABSTRACT Crawling is a process in which web search engines collect data from the web. Focused crawling is a special type of crawling process where crawler look for information related to a predefined topic[1].In this paper a method for finding out the most relevant document among a set of documents for the given set of keyword is presented. Relevance checking is done with the help of Rogers-Tanimoto, MountFord and Baroni- Urbani/Buser similarity coefficients. The method uses genetic algorithm to show that the average similarity of documents to the query increases when Probability of mutation is taken as low and Probability of crossover is taken as high. The method does the performance analysis of different similarity coefficients on the same set of documents and applies ranking to the documents whose relevancy is highest among the three coefficients. General Terms Growth, retrieval, crawl, engine Keywords Relevancy, similarity, coefficients, genetic 1. INTRODUCTION Design of most focused crawlers is based on the vector space model. The model is used to judge the evenness of web pages and general web search algorithms. The relevance in turn work as guide in following target links [2]. One of the most important module of search engine is ranking module. The task of ranking module is to assign some ranking score to relevant pages using some criterion. Output of ranking module is an ordered set of pages according to their rank i.e. pages with high rank are near the top of the list and low rank pages are at the bottom of the list. These pages are then presented to the user in their ranking order. A GA based approach using Rogers-Tanimoto, MountFord and Baroni- Urbani/Buser similarity coefficients is taken in this paper for ranking the retrieved documents. 2. GENETIC ALGORITHM GAs are search algorithms that follow the concept of natural selection and genetics [3]. GA are powerful and very efficient search and optimization techniques motivated by the natural selection theory of Darwin [4]. Genetic Algorithms [5] are based on the principle of heredity and evolution which claims “in each generation the stronger individual survives and the weaker dies”. Therefore, each new generation would contain stronger (fitter) individuals in contrast to its ancestors. The process of GA’s is iteration based of constant population size of candidate solutions. In each generation/iteration each chromosome’s fitness in the current population is evaluated and new population evolves. Chromosomes with higher fitness values goes through reproduction phase in which selection, crossover and mutation operators are applied to get new population. Chromosomes with lower fitness values are discarded. Again this generated new population is evaluated and selection, crossover, mutation operators are applied. This process continues until we get an optimal solution for the given problem 2.1 Fitness Evaluation Fitness function is a function which is responsible for evaluating some value to indicate among number of solutions which one is optimum. It can also be considered as a measure of performance or fitness to show how fit is the candidate solution. The problem of IRS using GA is to retrieve documents using this fitness function. For finding the relevant document on the basis of some similarity measures we can have number of relevancy methods. Table 1: Coefficients Used As Fitness Function In Research [6] S.N Cofficient Name Similarity Formula 1. Rogers & Tanimoto (p+s) / (p+2*(q+r)+s) 2. Baroni- Urbani/Buser (p+sqrt(p*s)) / (p+q+r+sqrt(p*s)) 3. Mountford 2*p/(2*q*r+p*q+p*r)) For the calculation of similarity metric we define few parameters p,q,r and s as (n = p+q+r+s). p= (x=1 and y=1) (total match) q= (x=1 and y=0) (single match) r= (x=0 and y=1) (single match) s= (x=0 and y=0) (no match) This is shown in table 2, where Table 2: Variables Used to Calculate Binary Similarities/Dissimilarities[7,6,8] y=1 y=0 x=1 p=1/1 in both A and B q=1/0 only in A x=0 r=0/1 only in B s=0/0 in none of A and B Where A and B may be any query or document represented in binary form.