On Bayesian Interpretation of Fact-finding in Information Networks Dong Wang, Tarek Abdelzaher, Hossein Ahmadi, Jeff Pasternack, Dan Roth, Manish Gupta, Jiawei Han, Omid Fatemieh, and Hieu Le Department of Computer Science, University of Illinois Urbana, IL 61801 Charu Aggarwal IBM Research Yorktown Heights, N.Y. 10598 Abstract—When information sources are unreliable, informa- tion networks have been used in data mining literature to uncover facts from large numbers of complex relations between noisy variables. The approach relies on topology analysis of graphs, where nodes represent pieces of (unreliable) information and links represent abstract relations. Such topology analysis was often empirically shown to be quite powerful in extracting useful conclusions from large amounts of poor-quality information. However, no systematic analysis was proposed for quantifying the accuracy of such conclusions. In this paper, we present, for the first time, a Bayesian interpretation of the basic mechanism used in fact-finding from information networks. This interpretation leads to a direct quantification of the accuracy of conclusions obtained from information network analysis. Hence, we provide a general foundation for using information network analysis not only to heuristically extract likely facts, but also to quantify, in an analytically-founded manner, the probability that each fact or source is correct. Such probability constitutes a measure of quality of information (QoI). Hence, the paper presents a new foundation for QoI analysis in information networks, that is of great value in deriving information from unreliable sources. The framework is applied to a representative fact- finding problem, and is validated by extensive simulation where analysis shows significant improvement over past work and great correspondence with ground truth. Keywords: Information networks, sensors, Bayesian infer- ence. I. I NTRODUCTION Information networks are a key abstraction in data mining literature used to uncover facts from a large number of relations between unreliable observations [1]. The power of information network analysis lies in its ability to extract useful conclusions even when the degree of reliability of the input data or observations is not known in advance. For example, given a set of claims from a multitude of sources, one can rank both the claimed information pieces (let us call them assertions) and their sources by credibility, given no a priori knowledge of the truthfulness of the individual assertions and sources. Alternatively, given only data on who publishes in which conferences one can rank both the authors and the conferences by authority in the field. This paper presents a new analytic framework that enables, for the first time, the calculation of correct probabilities of conclusions resulting from information network analysis. Such probabilities constitute a measure of quality of information (QoI). Our analysis relies on a Bayesian interpretation of the basic inference mechanism used for fact-finding in information network literature. In the simplest version of fact-finding from information net- works, nodes represent entities such as sources and assertions. Edges denote their relations (e.g., who claimed what). Each category of nodes is then iteratively ranked. Assertions are given a ranking that is proportional to the number of their sources, each source weighted by its credibility. Sources are then given a ranking that is proportional to the number of the assertions they made, each weighted by its credibility. This iterative ranking process continues until it converges. Information network analysis is good at such ranking. While the algorithms compute an intuitive “credibility score”, as we demonstrate in this paper, they do not actually compute the real probability that a particular conclusion is true. For example, given that some source is ranked 17th by credibility, it is not clear what that means in terms of probability that the source says the truth. Our paper addresses this problem, providing a general analytic foundation for quantifying the probability of correctness in fact-finding literature. We show that the probabilities computed using our analysis are significantly more accurate than prior work. The fact-finding techniques addressed in the paper are particularly useful in environments where a large number of sources are used whose reliability is not a priori known (as opposed collecting information from a small number of well- characterized sources). Such situations are common when, for instance, crowd-sourcing is used to obtain information, or when information is to be gleaned from informal sources such as Twitter messages. We focus on networks of sources and assertions. The Bayesian interpretation derived in this paper allows us to accurately quantify the probability that a source is truthful or that an assertion is true in the absence of de- tailed prior knowledge. Note that, while only source/assertion networks are considered, the analysis allows us to represent a much broader category of information networks. For example, in the author/conference network, one can interpret the act of publishing in a conference as an implicit assertion that the conference is good. The credibility of the assertion depends on the authority of the author. Hence, the network fits the source/assertion model. This paper is intended to be a first step towards a new category of information network analysis. Being the first step,