IJIRST –International Journal for Innovative Research in Science & Technology| Volume 1 | Issue 7 | December 2014 ISSN (online): 2349-6010 All rights reserved by www.ijirst.org 168 A Survey on Semantic Focused Web Crawler for Information Discovery Using Data Mining Technique Ruchika Patel Pooja Bhatt Department of Computer Engineering Department of Computer Engineering Ipcowala Institute of Engineering & Technology, Dharmaj, Anand, Gujarat, India - 388430 Ipcowala Institute of Engineering & Technology, Dharmaj, Anand, Gujarat, India - 388430 Abstract Data mining is the process of extraction of hidden predictive information from the huge databases. It is a new technology with great latent to help companies focus on the most important information in their data warehouses. Web mining is a data mining techniques which automatically discover information from web documents. The amount of data and its dynamicity makes it impossible to crawl the World Wide Web (WWW) completely. It’s a challenge in front of crawlers to crawl only the relevant pages from this information explosion. Thus a focused crawler solves this issue of relevancy by focusing on web pages for some given topic or a set of topics. Nowadays finding meaningful information among the billions of information resources on the World Wide Web is a difficult task due to growing popularity of the Internet. This paper basically focuses on study of the various techniques of data mining for finding the relevant information from World Wide Web using web crawler. Keywords: Web Mining, Web Crawler, Focused Crawler, World Wide Web (WWW). _______________________________________________________________________________________________________ I. INTRODUCTION The internet has becoming the largest unstructured database for accessing information over the documents. [8] It is well recognized that the information technology has a profound effect on the conduct of the business, and the Internet has become the largest marketplace in the world. Innovative business professionals have realized the commercial applications of the Internet for their customers and strategic partners. [2] With the rapid growth of electronic text from the complex the WWW, more and more knowledge you need is included. But, the massive amount of text also takes so much trouble to people to find useful information. For example, the standard Web search engines have low precision, since typically some relevant Web pages are returned mixed with a large number of irrelevant pages, which is mainly due to the situation that the topic-specific features may occur in different contexts. So, one appropriate way of organizing this overwhelming amount of documents is necessary. [1] The World Wide Web is an architectural framework for accessing linked documents spread out over millions of machines all over the Internet. Overview of Web Mining A. Web mining refers to the discovery of knowledge from Web data that include Web pages, media objects on the Web, Web links, Web log data, and other data generated by the usage of Web data. Web mining is classified into: (a) Web content mining, (b) Web structure mining and (c) Web usage mining. [8] Web content mining refers to mining knowledge from Web pages and other Web objects. Web structure mining refers to mining knowledge about link structure connecting Web pages and other Web objects. Web usage mining refers to the mining of usage patterns of web pages found among users accessing a Website. Among the three, Web content mining is perhaps studied most extensively due to the prior work in text mining. The traditional topics covered by Web content mining include: Web page classification 1) This involves the classification of Web pages under some pre-defined categories that may be organized in a tree or other structures. [8] Web clustering 2) This involves the grouping of Web pages based on the similarities among them. Each resultant group should have similar Web pages while Web pages from different resultant groups should be dissimilar. [8]