Harvesting Knowledge from Web Data and Text Tutorial Proposal for CIKM 2010 (1/2 Day) Hady W. Lauw 1 , Ralf Schenkel 2 , Fabian Suchanek 3 , Martin Theobald 4 , and Gerhard Weikum 4 1 Institute for Infocomm Research, Singapore 2 Saarland University, Saarbr¨ ucken 3 INRIA Saclay, Paris 4 Max Planck Institute Informatics, Saarbr¨ ucken Keywords: information extraction, knowledge harvesting, machine reading, RDF knowledge bases, ranking 1 Overview and Motivation The Web bears the potential of being the world’s greatest encyclopedic source, but we are far from fully ex- ploiting this potential. Valuable scientific and cultural content is interspersed with a huge amount of noisy, low- quality, unstructured text and media. The proliferation of knowledge-sharing communities like Wikipedia and the advances in automated information extraction from Web pages give rise to an unprecedented opportunity: Can we systematically harvest facts from the Web and compile them into a comprehensive machine-readable knowledge base? Such a knowledge base would contain not only the world’s entities, but also their semantic properties, and their relationships with each other. Imagine a “Structured Wikipedia” that has the same scale and richness as Wikipedia itself, but offers a precise and concise representation of knowledge, e.g., in the RDF format. This would enable expressive and highly precise querying, e.g., in the SPARQL language (or appropriate extensions), with additional capabilities for informative ranking of query results. The benefits from solving the above challenge would be enormous. Potential applications include 1) a formalized machine-readable encyclopedia that can be queried with high precision like a semantic database; 2) a key asset for disambiguating entities by supporting fast and accurate mappings of textual phrases onto named entities in the knowledge base; 3) an enabler for entity-relationship-oriented semantic search on the Web, for detecting entities and relations in Web pages and reasoning about them in expressive (probabilistic) logics; 4) a backbone for natural-language question answering that would aid in dealing with entities and their rela- tionships in answering who/where/when/ etc. questions; 5) a key asset for machine translation (e.g., English to German) and interpretation of spoken dialogs, where world knowledge provides essential context for disambiguation; 6) a catalyst for acquisition of further knowledge and largely automated maintenance and growth of the knowl- edge base. While these application areas cover a broad, partly AI-flavored ground, the most notable one from a database perspective is semantic search: finally bringing DB methodology to Web search! For example, users (or tools on behalf of users) would be able to formulate queries about succulents that grow both in Africa and America, politicians who are also scientists or are married to singers, or flu medication that can be taken by people with high blood pressure. The search engine would return precise and concise answers: lists of entities or entity pairs (depending on the question structure), for example, Angela Merkel, Benjamin Franklin, etc., or Nicolas Sarkozy for the questions about scientists. This would be a quantum leap over today’s search where an- swers are embedded if not buried in lots of result pages, and the human users would have to read them to extract entities and connect them to other entities. In this sense, the envisioned large-scale knowledge harvesting [42, 43] from Web sources may also be viewed as machine reading [13]. 2 Target Audience, Aims, and Organization of the Tutorial The tutorial is aimed towards a broad audience of researchers from the DB, IR, and KM communities, especially those interested in data and text mining, knowledge extraction, knowledge-based search, and uncertain data management. It aims at providing valuable knowledge about available data assets, as well as basic methods for knowledge base construction and querying to researchers working on knowledge discovery, semantic search on Web and enterprise sources, or coping with automatically extracted facts as a major use case for uncertain data management. In addition, it summarizes the state of the art, and points out research opportunities to those who