Catriple: Extracting Triples from Wikipedia Categories Qiaoling Liu 1 , Kaifeng Xu 1 , Lei Zhang 2 , Haofen Wang 1 , and Yong Yu 1 , and Yue Pan 2 1 Apex Data and Knowledge Management Lab Shanghai Jiao Tong University, Shanghai, 200240, China {lql,kaifengxu,whfcarter,yyu}@apex.sjtu.edu.cn 2 IBM China Research Lab Beijing, 100094, China {lzhangl,panyue}@cn.ibm.com Abstract. As an important step towards bootstrapping the Semantic Web, many efforts have been made to extract triples from Wikipedia because of its wide coverage, good organization and rich knowledge. One kind of important triples is about Wikipedia articles and their non-isa properties, e.g. (Beijing, country, China). Previous work has tried to ex- tract such triples from Wikipedia infoboxes, article text and categories. The infobox-based and text-based extraction methods depend on the in- foboxes and suffer from a low article coverage. In contrast, the category- based extraction methods exploit the widespread categories. However, they rely on predefined properties, which is too effort-consuming and explores only very limited knowledge in the categories. This paper auto- matically extracts properties and triples from the less explored Wikipedia categories so as to achieve a wider article coverage with less manual ef- fort. We manage to realize this goal by utilizing the syntax and semantics brought by super-sub category pairs in Wikipedia. Our prototype imple- mentation outputs about 10M triples with a 12-level confidence ranging from 47.0% to 96.4%, which cover 78.2% of Wikipedia articles. Among them, 1.27M triples have confidence of 96.4%. Applications can on de- mand use the triples with suitable confidence. 1 Introduction Extracting as much semantic data as possible from the Web is an important step towards bootstrapping the Semantic Web. Many efforts have been made to extract triples from Wikipedia because of its wide coverage of domains and good organization of contents. More importantly, Wikipedia embraces the power of collaborative editing to harness collective intelligence, which results in rich knowledge from its articles, categories and infoboxes. Table 1 shows the volume of the rich knowledge contained in English Wikipedia 1 . One kind of important triples is about Wikipedia articles and their non-isa properties, e.g. (Beijing, 1 The data used in this paper is from English Wikipedia database dump on 2008-1-3. J. Domingue and C. Anutariya (Eds.): ASWC 2008, LNCS 5367, pp. 330–344, 2008. c Springer-Verlag Berlin Heidelberg 2008