International Journal of Computer Applications (0975 – 8887) Volume 61– No.12, January 2013 28 Web Information Extraction: Tag Density and Keyword Approach Shikha Shukla, Nitin, Sitendra Tamrakar NRI Institute of Science and Technology ABSTRACT Web page consists of lots of noise in the form of advertisements, irrelevant information, copyrights information and menus. To extract the information from web we use the two concepts, text density and title of the page. Generally the main content of the page is denser than the other and noises has lesser text information. The title is the most important information on the page that tells us about what is this page for. So we simply extract all the information that is denser than particular threshold or at least contain one of the keywords that is made from the title of the page. By using this approach the more false negatives can be avoided. This approach gives very satisfactory results. Keywords- Crawler, Web mining, information extraction 1. INTRODUCTION The data on the web are increasing exponentially due to rapid use of social networking and e-commerce websites and users are relying more on the web for their daily activities such as online news, social networking, movies, shopping etc. As the more and more web users are increasing so does the noise like advertisements, bogus information etc. So it’s become very important task to filter the noises and extract only those which are important. Several works has been done in this field. Shin, Kwangcheol, and Geun Sik Jo [1] used style sheets to extract the information but the problem with this approach is it required the user interaction and user has to select some information on the page that he is interested in then the approach will search all the information that follow the same style sheets. Sun, Fei, Dandan Song, and Lejian Liao [2] used DOM tree based approach and extracting maximum density data as a main content. Asfia, Mohsen, Mir Mohsen Pedram, and Amir Masoud Rahmani [3] has used VCE(Visual clustering extractor) algorithm that uses DOM tree as input and produces smaller blocktree and some general parameters which are used to determine main block later. Downey, Doug, et al [4] extracted the information by adding the pattern learning algorithm which learns about the most common patterns in which instances of class appear. Yi, Lan, and Bing Liu [5] proposed cleaning technique is based on layouts and contents of the Web pages in a Web site. In their proposed method they first find a suitable data structure to capture and represent common layouts or presentation styles in a set of pages of the Web site and used compressed structure tree (CST) for this purpose. The compressed structure tree has some entropy measure assigned to the node which is used for the noise removal from the web page. Our approach uses the text density and combination of keywords to efficiently extract the information. This paper is divided is divided into three parts first is introduction which is almost discussed the next one is methodology which covers all the methods and concept to extract the information. And the last part is result evaluation which will show the performance of our approach. 2. METHODOLOGY No Yes Figure 1 Flow Chart of Proposed Method Write Html Parse Html Write Html Read URL Write Input Html To Text? Output Main Content Start End Read Html