WISDOM: Web Intra-page Informative Structure Mining based on Document Object Model Hung-Yu Kao, Jan-Ming Ho * , and Ming-Syan Chen + Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan, ROC E-Mail: hykao@mail.ncku.edu.tw *Institute of Information Science Academia Sinica Taipei, Taiwan, ROC E-Mail: hoho@iis.sinica.edu.tw + Electrical Engineering Department National Taiwan University Taipei, Taiwan, ROC E-Mail: mschen@cc.ee.ntu.edu.tw ABSTRACT To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intra-site redundant information, such as navigation panels, advertisements and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intra-page informative structure in news Web sites in order to find and eliminate redundant information. Note that intra-page informative structure is a sub-set of the original Web page and is composed of a set of fine-grained and informative blocks. The intra-page informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intra-page informative structure mining system called WISDOM (Web Intra-page Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small sub-trees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM’s practical applicability. KEYWORDS Intra-page informative structure, DOM, entropy, information extraction