Automated Metadata and Instance Extraction from News Web Sites Srinivas Vadrevu, Saravanakumar Nagarajan, Fatih Gelgi, Hasan Davulcu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, 85287, USA {svadrevu, nrsaravana, hdavulcu, fagelgi}@asu.edu Abstract In this paper, we present automated techniques for ex- tracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. We present tree-mining algorithms that identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report ex- perimental evaluation for the news domain to demonstrate the efficacy of our algorithms. 1 Introduction The problem of extracting, managing and organizing the data from unstructured and semi-structured Web pages is an important problem, investigated by several researchers [1, 5, 8]. Critical information such as metadata and attribute labels is usually unlabeled and difficult to locate. It is also presented in various incompatible formats in different Web sources. This data must be digested into an organized into a uniform manner such that it can be used for scalable ad-hoc querying, automatic summarization, integration and media- tion over the Web. There are a plethora of techniques that explore informa- tion extraction from semi-structured and unstructured Web sources. For example wrapper induction [10], and semi- automated wrapper learning [2] methods work by learn- ing the path expressions to the data. These approaches re- quire human intervention by either requiring labeled exam- ples or to either manually maintain the wrapper. On the other hand, schema learning [14] and automatic data ex- traction [5, 1, 11] methods work on structured Web sites to extract the schema and reconstruct the template of the Web pages. These approaches have rigid requirements on the input Web pages that they need to be template-driven and regularly structure their content in an uniform manner. For example, RoadRunner [5] works with a pair of documents from a collection of template generated Web pages to infer a grammar for the collection using union-free regular expres- sions. Another class of algorithms [16, 3, 6] require that an ontology of concepts, relationships and their value types is provided apriori in order to find matching information. In order to develop efficient techniques to extract the metadata and instance information from Web pages in an automated manner, it is usually helpful to exploit specific characteristics of the domain of interest. One such domain of interest is that of on-line newspapers and news portals on the Web, which have become one of the most important sources of up-to-date information. There are indeed thou- sands of sites that provide daily news in a very distinct for- mats and there is a growing need for tools that will allow individuals to access and keep track of this information in an automatic manner. In this paper, we present techniques for automatically extracting the metadata and instance information by orga- nizing and mining a set of news Web sites. We extract a common news taxonomy that organizes the important con- cepts and individual news articles for these concepts with their attribute information. OntoMiner differs from the earlier information extrac- tion methods in a way that it works in a completely auto- mated manner without any human intervention, it does not require any labeled training examples, and it does not as- sume anything about the presentation template of the input Web pages. The main contributions of OntoMiner system are threefold, described as following: • A semantic partitioning algorithm that logically seg- ments the page and groups and organizes the content in an HTML Web page. • A taxonomy mining algorithm that organizes impor- tant concepts in a set of overlapping Web sites. • An instance mining algorithm that extracts individual instances with their attribute labels from Web pages that belong to the same category. OntoMiner system is initialized with a collection of news Web sites. It proceeds by detecting and utilizing the HTML