Web News Extraction via Path Ratios Gongqing Wu 1 , Li Li 1 , Xuegang Hu 1 , Xindong Wu 1, 2 1 School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China 2 Department of Computer Science, University of Vermont, Burlington, VT 05405, U.S.A. wugq@hfut.edu.cn, banli811@163.com, jsjxhuxg@hfut.edu.cn, xwu@uvm.edu ABSTRACT In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively extracting the news content and filtering the noise have important effects on the follow-up activities of content management and analysis. Our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Based on this observation, we design two tag path features to measure the importance of nodes: Text to tag Path Ratio (TPR) and Extended Text to tag Path Ratio (ETPR), and describe the calculation process of TPR by traversing the parsing tree of a web news page. In this paper, we present Content Extraction via Path Ratios (CEPR) - a fast, accurate and general on-line method for distinguishing news content from non-news content by the TPR/ETPR histogram effectively. In order to improve the ability of CEPR in extracting short texts, we propose a Gaussian smoothing method weighted by a tag path edit distance. This approach can enhance the importance of internal-link nodes but ignore noise nodes existing in news content. Experimental results on the CleanEval datasets and web news pages randomly selected from well-known websites show that CEPR can extract across multi-resources, multi-styles, and multi-languages. The average F and average score with CEPR is 8.69% and 14.25% higher than CETR, which demonstrates better web news extraction performance than most existing methods. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information filtering; H.3.1 [Content Analysis and Indexing]: Abstracting methods General Terms Algorithms, Experimentation Keywords Content extraction, web news, text to tag path ratio, weighted Gaussian smoothing 1. INTRODUCTION The Web has become a platform for content producing and consuming. According to Pew Internet & American Life tracking surveys, reading news is one of the most popular behaviors of Internet users. An investigation in November 2005 showed that over 46% Internet users read web news on a typical day 1 . Besides, the traditional newspapers have also developed their corresponding web news sites to follow this trend. As reading web news is the fastest way to acquire rich information, web news owns a huge user community. Table 1 shows the top 5 popular news websites and their estimated numbers of unique monthly visitors as of May 1, 2013 2 . Table 1. The top 5 most popular news websites and their estimated unique monthly visitors Popular News Website Unique Monthly Visitors Yahoo! News 110 million CNN 74 million MSNBC 73 million Google News 65 million New York Times 59.5 million Besides news content, a typical web news page contains title banners, advertisements, related links, copyrights and disclaimer notices. These additional non-news items, which are also known as noise, are responsible for roughly 40-50% of content on news websites [1]. Therefore, many web applications need to clean up news web pages. For instance, with highly-cleaned news inputted, the quality of web news summaries will be improved [2]; pocket- sized devices with small screens like mobile phones or PADs will achieve a better effect of user experience. In addition, the noise costs more data storage space and more computing time in web information retrieval, web content management and analysis, and reduces the quality of service at the same time. Our goal is to extract news content and filter non-news noise from web pages. This goal does not seem very complicated, but extracting news content from millions of websites without consistent news publication standards is really a non-trivial problem. Massive and heterogeneous web news brings a data management challenge to handcrafted or rule-based learning techniques. These techniques are suitable for building wrappers for specific websites, but they usually become invalid in extracting news from massive and heterogeneous web pages in an open environment. In addition, most vision-based or template- based wrappers have failed to keep up with the changes of 1 http://www.pewinternet.org/~/media/Files/Reports/2005/PIP_Sea rchData_1105.pdf.pdf 2 http://www.ebizmba.com/articles/news-websites, May 1, 2013. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. CIKM'13, October 27 - November 01, 2013, San Francisco, CA, USA. Copyright 2013 ACM 978-1-4503-2263-8/13/10…$15.00. http://dx.doi.org/10.1145/2505515.2505558