Ranjna Gupta et. al. / (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 04, 2010, 1395-1400 Query Based Duplicate Data Detection on WWW Ranjna Gupta #1 , Neelam Duhan *2 , A.K. Sharma *3 , Neha Aggarwal #4 # Computer Science & Engineering B.S.A. Institute of Technology & Management Faridabad, India Ranjna.gupta@gmail.com Aggarwalneha2k@gmail.com *Department of Computer Engineering Y.M.C.A. University of Science &Technology Faridabad, India Neelam.duhan@gmail.com Abstract— The problem of finding relevant documents has become much more prominent due to the presence of duplicate data on the WWW. This redundancy in results increases the users’ seek time to find the desired information within the search results, while in general most users just want to cull through tens of result pages to find new/different results. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data. Therefore, a mechanism needs to be introduced for detecting duplicate data so that relevant search results can be provided to the user. In this paper, architecture is being proposed that introduces methods that run online as well as offline on the basis of favored and disfavored user queries to detect duplicates and near duplicates. Keywords-WWW; Query log; Cluster; Search Engine; Ranking Algorithm I. INTRODUCTION The development of Internet has resulted in flooding of numerous copies of web documents in the search results making them futilely relevant to the users thereby creating a serious problem for internet search engines. The outcome of perpetual growth of Web and e-commerce has led to an increased demand of new Web sites and Web applications. The tremendous volume of web documents poses challenges to the performance and scalability of web search engines. Duplicate is an inherent problem that search engines have to deal with. It has been reported that about 10% hosts are mirrored to various extents in a study including 238,000 hosts [1]. Consequently, many identical or near-identical results would appear in the search results if search engines do not solve this problem effectively. Such duplicates will significantly decrease the perceived relevance of search engines. Therefore, automatic duplicate documents detection is a crucial problem in front of search engines. “Duplicate documents“refer not only to completely identical documents but also to nearly identical documents. In this paper, an approach has been proposed that detects duplicates and near duplicates using offline as well as online techniques so that relevant search results can be displayed to the users. The paper has been organized as follows: section II describes the current research that has been carried out in this area; section III illustrates the proposed work to detect duplicate web pages based on query clusters formed by using query logs; section IV shows the performance of proposed work and last section concludes the proposed work. II. RELATED WORK The notion of Duplicate data detection has been a subject of interest since many years. A number of researchers have discussed the problem of finding relevant search results from the search engines. A technique for estimating the degree of similarity among pairs of documents was presented in 1997 [2], known as shingling, does not rely on any linguistic knowledge other than the ability to tokenize documents into a list of words, i.e., it is merely syntactic. In this, all word sequences (shingles) of adjacent words are extracted. If two documents contain the same set of shingles they are considered equivalent and if their sets of shingles appreciably overlap, they are exceedingly similar. A new approach that performs copy detection on web documents [3] determines the similar web documents, similar sentences and graphically captures the similar sentences in any two web documents. Besides handling wide range of documents, this copy detection approach is applicable to web documents in different subject areas as it does not require static word lists. A novel algorithm, Dust Buster, for uncovering DUST (Different URLs with Similar Text) [4] has also been proposed in the literature. This method intended to discover rules that transform a given URL to others that are likely to have similar content. Dust Buster employs previous crawl logs or web server logs instead of probing the page contents to ISSN : 0975-3397 1395