ISSN (Print) : 2320 – 9798 ISSN (Online): 2320 – 9801 International Journal of Innovative Research in Computer and Communication Engineering Vol. 1, Issue 4, June 2013 Copyright to IJIRCCE www.ijircce.com 1100 Fuzzy Logic Approach to Combat Web Spam with TrustRank Amit Prakash 1 , Debjani Mustafi 2 M.E. Student, Dept. of CSE, B.I.T. Mesra, Jharkhand, India 1 Assistant Professor, Dept. of I.T, B.I.T. Mesra, Jharkhand, India 2 ABSTRACT: Web spam refers to techniques that manipulates the ranking algorithms of web search engines and cause them to rank search results higher than they deserve [1]. The spam web pages may pretend to provide assistance or facts about a particular subject, but the help is often meaningless and the information shallow. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. Today's search engines use variations of the fundamental ranking methods that feature some degree of spam resilience. PageRank is one of them which not only counts the number of hyperlinks referring to a web page, but also takes the PageRank of the referring page into account, but this concept has proven to be vulnerable to manipulation [12]. TrustRank overcomes the PageRank problems but involves human operators to judge seed sets to find if a page is spam or not. There are situations where an operator fails to assign a crisp value to a page. In such case a human sentiment involve in deciding a page is spam or not. Our work reveals the human sentiment involved in the judgment of seed set. We also proposed a model that minimizes the involvement of human sentiment by employing Fuzzy Logic in seed selection process. Keywords: PageRank, TrustRank, Fuzzy Logic, Spam, Search Engine, Web Graph. I. INTRODUCTION Spam is an arms race between search engine and spammers, since spammers are constantly coming up with more and more sophisticated techniques to beat search engines. It is a multi-million dollar industry which is trying to fool search engines. Spammers exploit the way search engines work and deliberately manipulate search engine indexes [1, 4]. Everybody that publishes a site on the web wishes that it will be found. And not only this. It should appear in the top 10 of the search engines. Why? Because people often click on the results of the first page only. Even worse, it has been shown, that most users look only at the topmost two links of the result page. Search engines are the entryways to the Web, which is why some people try to mislead search engines, so that their pages would rank high in search results, and thus, capture user attention [4]. There is a lot of money at stake. There is a huge amount of value from getting to the top of the search results. If spammers get their links to the top of the page, billions will see it. So, how deliberately show page in the first few results? There are two ways: Content spamming and Link spamming [1, 11]. Content spamming techniques involve altering the logical view that a search engine has, over the page's contents. This is done by Engineering the page well, understandable by humans and search engines, delivering extra meta-tags for the search engines, using different words, feeding as many keywords as possible, disregarding if they match the topic of the page or not. Keyword stuffing, Hidden or invisible text, Meta-tag stuffing, Doorway pages, Scraper sites are the most common techniques for content spamming. They all aim at variants of the vector space model for information retrieval on text collections. Link spam is defined as links between pages that are present for reasons other than merit. Link spam takes advantage of link-based ranking algorithms, which gives websites higher rankings the more other highly ranked websites link to it [2, 11]. Since people do not link to spam pages freely, spammers trick many of them to point to a spam page. Or make the referencing pages himself under different domains. Many search engines uses PageRank algorithm to rank their search results that take into account the number of incoming links in ranking pages [5, 6]. PageRank assumes each link is a valid vote for a web site. But, some links are not really valid links at all. In practice, the PageRank concept has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank [12]. There are many blatant PageRank manipulation techniques. Another problem with PageRank is that it presents a bias against new web sites. To overcome the problems of PageRank, TrustRank idea was developed. The TrustRank algorithm is a procedure to rate the quality of websites. The basic idea is similar to the PageRank algorithm - taking the linking structure to generate a measure for the quality of a page. TrustRank is a semi-automatic process which involves human operators to judge seed sets [3]. Pages are divided up in good (white) pages and bad (black) pages. It is also assumed that good pages seldom points to bad pages [3]. On the other hand bad pages quite often link to bad pages. The ideal trust property is calculated using these assumptions and eventually list the pages according to their probability of being trusted with a score between 0 (bad) and 1 (good). Sometime there may be a situation where we get a page in between