Extraction of RDF Dataset from Wikipedia Infobox Data Jimmy K. Chiu, Thomas Y. Lee, Sau Dan Lee, Hailey H. Zhu, David W. Cheung The University of Hong Kong, Hong Kong {khchiu,ytlee,sdlee,hyzhu,dcheung}@cs.hku.hk ABSTRACT This paper outlines the cleansing and extraction process of infobox data from Wikipedia data dump into Resource De- scription Framework (RDF) triplets. The numbers of the ex- tracted triplets, resources, and predicates are substantially large enough for many research purposes such as semantic web search. Our software tool will be open-sourced for re- searchers to produce up-to-date RDF datasets from routine Wikipedia data dumps. 1. INTRODUCTION 1.1 Motivation Resource Description Framework (RDF) is recommended by W3C as a solution for representing Internet resources in semantic web [2]. Many research studies have been done on semantic web searching or eﬃcient retrieval of RDF data. However, we lack good RDF datasets for research purposes because many existing RDF datasets have some of the fol- lowing problems. First, the some datasets do not contain a large number of triplets, resources or predicates for exper- iments of datamining techniques on very large databases. For example, the Jamendo RDF dataset[8] contains only around 1M triplets and the Barton dataset[7] contains only 221 predicates. Second, some datasets require speciﬁc do- main knowledge to comprehend, which makes human inter- pretation of experimental results diﬃcult. For example, the Uniprot dataset[3] requires researchers to have life science knowledge. Third, many datasets are static, which cannot represent continuous changes of web resources and their re- lationships. Wikipedia is an online encyclopedia edited by the public. Although the content format of a Wikipedia page (wikipage) is essentially unstructured, many pages still have pieces of structured information called infobox, which can be trans- formed into meaningful RDF data. While the Wikipedia data are easy to understand by common sense, they are also regularly archived into database dumps for making evolv- ing datasets. Therefore, it is useful to transform Wikipedia infobox data into an RDF dataset for benchmarking and experimenting database techniques. This paper describes the tool we have developed to extract infobox data from Wikipedia database dumps into RDF datasets. 1.2 Wikipedia Data Wikipedia is regularly backed up into dump ﬁles[4]. One format of these dump ﬁles is in XML, as shown in Fig. 1. The content of a single page is put inside a <page> tag. For <page> <title>Tsing Ma Bridge</title> <id>91180</id> <revision> <id>160476218</id> <timestamp>2007-09-26T14:40:42Z</timestamp> <contributor> <username>Sameboat</username> <id>953247</id> </contributor> <text>...wikitext here...</text> </revision> </page> Figure 1: XML fragment segment from Wikipedia dump for the “Tsing Ma Bridge” wikipage a wikipage, the title is given by the <title> element and the source code (wikitext) is given by the <text> element. Although wikipages are written largely in the loosely- structured WikiMedia markup language[5], many wikipages have a section of structured data called infobox. An infobox is a collection of name-value pairs, each containing an in- fobox ﬁeld and a ﬁeld value. An infobox ﬁeld describes an attribute about the wikipage. Each infobox has a name, which is associated with one infobox template. The infobox template deﬁnes a suggested list of ﬁelds (attributes) that can be deﬁned on the wikipages describing the same class of objects or concepts. For example, Fig. 2 shows a wikitext fragment that speciﬁes the infobox data for the “Tsing Ma Bridge” wikipage. This infobox has a name Bridge and con- tains diﬀerent ﬁelds, e.g., bridge name, carries, width, etc. These ﬁelds are deﬁned in the bridge infobox template and shared by all infoboxes about bridges. Each ﬁeld speciﬁes an attribute of the object or concept described by the wikipage. For example, the bridge name is “Tsing Ma Bridge”. The value of a ﬁeld may contain double-square-bracketed inter- wiki links to other pages. For example, the ﬁeld locale contains two interwiki links [[Ma Wan Channel]] and [[Ma Wan|Ma Wan Island]] pointing to wikipages titled “Ma Wan Channel” and “Ma Wan” respectively. Note that “Ma Wan Island” is the label of the latter link for display. The aim of our software tool is to convert all infobox data in a Wikipedia dump into an RDF dataset. In each RDF triplet, the subject represents a wikipage, the predicate rep- resents an infobox ﬁeld, and the object represents a lateral value or a wikipage. We plan to open-source our tool for re- searchers to prepare RDF datasets from Wikipedia dumps.