Measuring the Quality of Web Content using Factual Information Elisabeth Lex Know-Center GmbH elex@know-center.at Michael Voelske Bauhaus-Universität Weimar michael.voelske@uni-weimar.de Marcelo Errecalde Edgardo Ferretti, Leticia Cagnina Universidad Nacional de San Luis {merreca|ferretti|lcagnina}@unsl.edu.ar Christopher Horn Graz University of Technology christopher.horn@tugraz.at Benno Stein Bauhaus-Universität Weimar benno.stein@uni-weimar.de Michael Granitzer University of Passau Michael.Granitzer@uni-passau.de ABSTRACT Nowadays, many decisions are based on information found in the Web. For the most part, the disseminating sources are not certified, and hence an assessment of the quality and credibility of Web content became more important than ever. With factual density we present a simple statistical quality measure that is based on facts extracted from Web content using Open Information Extraction. In a first case study, we use this measure to identify featured/good arti- cles in Wikipedia. We compare the factual density mea- sure with word count, a measure that has successfully been applied to this task in the past. Our evaluation corrobo- rates the good performance of word count in Wikipedia since featured/good articles are often longer than non-featured. However, for articles of similar lengths the word count mea- sure fails while factual density can separate between them with an F-measure of 90.4%. We also investigate the use of relational features for categorizing Wikipedia articles into featured/good versus non-featured ones. If articles have sim- ilar lengths, we achieve an F-measure of 86.7% and 84% otherwise. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.3 Infor- mation Search and Retrieval—Information filtering 1. INTRODUCTION People use the Web as a basis for their decisions and be- liefs. Due to lacking quality control, Web-based informa- tion sources often contain inaccurate and false information. Thus, in addition to the content itself, measures are needed to capture credibility and quality aspects. In this work, we propose a statistical quality measure called factual density, which assesses the quality of content with respect to facts. We define the factual density of a document as the num- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WebQuality ’12, April 16, 2012, Lyon, France Copyright 2012 ACM 978-1-4503-1237-0 ...$10.00. ber of facts found in this document in relation to the doc- ument’s length. Consequently, factual density indicates a document’s informativeness. We also propose to use binary relations, i.e. triples of the form (argument1, relation, argu- ment2) [3], as features to distinguish between high-quality factual content and non-factual content. Our hypothesis is that a document’s content is of higher quality if it is both factual and informative. 1.1 Related Work The quality of Web content has mainly been assessed with metrics capturing content quality aspects like objectivity [6], content maturity and readability [10]. A key aspect here is to determine an appropriate set of features. In [6], it is pro- posed to use stylometric features to assess content quality. Lipka and Stein [7] exploit character trigrams distributions to identify high quality featured/good articles in Wikipedia. Blumenstock [2] suggests to simply use word count as indi- cator for the quality of Wikipedia articles. To assess the factual accuracy of Web content, more com- plex, semantic features are needed. A common approach is to employ Open Information Extraction [4] or methods that use background knowledge on semantic relations available in ontological resources such as Wordnet [5] and Yago [9]. These approaches extract relational information about enti- ties named in a particular text (e.g., facts like f = (Mozart, was born in, Salzburg) ). Besides, they exploit defined se- mantic relationships such as meronymy and hypernymy, and others to infer relational information between entities,which is not given explicitly in the text. In this work, we refer to such features as relational features. 2. MEASURING THE QUALITY USING FACTUAL INFORMATION In order to measure information quality based on factual information, we propose three approaches: (i) using simple statistics about the facts obtained from a text, (ii) exploiting relational information contained in facts, and (iii) exploiting semantic relationships like meronymy and hypernymy. In this work, we focus on the first two approaches. In the first approach, we resort to simple statistical features about facts in order to determine the informativeness of a document. We denote this kind of features as fact frequency- based features.