IJCSMS International Journal of Computer Science & Management Studies, Special Issue of Vol. 12, June 2012 ISSN (Online): 2231 –5268 www.ijcsms.com 358 IJCSMS www.ijcsms.com An Improved Approach to perform Crawling and avoid An Improved Approach to perform Crawling and avoid An Improved Approach to perform Crawling and avoid An Improved Approach to perform Crawling and avoid Duplicate Web Pages Duplicate Web Pages Duplicate Web Pages Duplicate Web Pages Dhiraj Khurana 1 , Satish Kumar 2 1 Assistant Professor, CSE Department University Institute of Engineering & Technology Maharshi Dayanand University, Rohtak(Haryana) dhirajkhurana23@rediffmail.com 2 Assistant Professor, CSE Department Vaish College of Engineering, Rohtak (Haryana) Krsk23@gmail.com Abstract When a web search is performed it includes many duplicate web pages or the websites. It means we can get number of similar pages at different web servers. We are proposing a Web Crawling Approach to Detect and avoid Duplicate or Near Duplicate WebPages. In this proposed work we are presenting a keyword Prioritization based approach to identify the web page over the web. As such pages will be identified it will optimize the web search. Keywords: Crawler, Optimization, Duplicate, Webpage, Prioritization 1. Introduction Besides piracy one of the problems on the Internet these days is redundant information, which exist due to replicated pages archived at different locations like mirror sites. As a result, the burden is on Web users to sort through retrieved Web pages to identify non- redundant data, which is a tedious and tiring process. Since the amount of information available on the Internet increases on a daily basis, filtering redundant and similar documents becomes a more difficult task to the user. Due to the rapid growth of electronic documents, redundant information increases on the Web. In order to use the information available on the Web many technologies emerged, information retrieval systems is one of them. 1.1 RESEARCH PROBLEM The web crawler is the basic requirement to search and download data efficiently from the web. Most of the search engine and downloader uses the same tool to detect and fetch the pages. But today the user requirement is generally very specific such as a researcher only wants to search data on his required topic. For this the topic based web crawler is used. 1.2 PROPOSED GOALS In this proposed work we are working on topic based incremental crawler. When we perform a topic based search we can find some similar topics also. The proposed work is about to exclude such kind of pages from the list of downloadable pages. For this duplicate page analysis we are proposing a suffix tree based approach that will perform the keyword based matching in optimum time. The proposed work is about to optimize the crawling process by excluding such pages in topic based search. 2. RESEARCH METHODOLOGY An approach for detecting duplicate web pages in web crawling by use of constructive, analytical and exploratory research design. Constructive research design to get the objectives clearly defined, analytical research design to use facts or information already available, and analyze these to make a critical evaluation for research. 3. RELATED WORK Akansha Singh performed a work,” Faster and Efficient Web Crawling with Parallel Migrating Web Crawler This paper aims at designing and implementing such a parallel migrating crawler in which the work of a crawler is divided amongst a number of independent and parallel crawlers which migrate to different machines to improve network efficiency and speed up the downloading. The migration and parallel working of the proposed