How Do Metrics of Link Analysis Correlate to Quality, Relevance and Popularity in Wikipedia? Raiza Tamae Sarkis Hanada, Maria da Graça Campos Pimentel ICMC – Universidade de São Paulo {rhanada, mgp}@icmc.usp.br Marco Cristo Universidade Federal do Amazonas Instituto de Computação marco.cristo@icomp.ufam.edu.br ABSTRACT Many links between Web pages can be viewed as indicative of the quality and importance of the pages they pointed to. Accordingly, several studies have proposed metrics based on links to infer web page content quality. However, as far as we know, the only work that has examined the correla- tion between such metrics and content quality consisted of a limited study that left many open questions. Despite the fact that these metrics showed to be successful in the task of ranking pages provided as answers to queries submitted to search engines, it is not possible to determine the specific contribution that factors such as quality, popularity, and im- portance have on the results. This difficulty is partially due to the fact that such information about Web pages in general is hard to obtain. Unlike ordinary Web pages, the quality, importance and popularity of Wikipedia articles are evalu- ated by human experts or might be easily estimated. Thus, it is feasible to verify the relationship between link analysis metrics and the afore mentioned factors in Wikipedia arti- cles. This is our goal in this work. In order to accomplish that, we implemented several link analysis algorithms and compared their resulting rankings with the ones created by human evaluators regarding factors such as quality, popu- larity and importance. We found that the metrics are more correlated to quality and popularity than to importance, and that the correlations are moderate. Categories and Subject Descriptors: H.3[INFORMATION STORAGE AND RETRIEVAL. H.3.m[Miscellaneous] General Terms: Measurement. Keywords: Link Analysis, Wikipedia, Quality of Content. 1. INTRODUCTION As result of a large and fast expansion, the Web became a huge collection of decentralized and disorganized documents. Nevertheless, many companies have successfully developed tools to find information on the Web. Such tools are the ACM acknowledges that this contribution was authored or co-authored by an em- ployee, contractor or affiliate of the national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to al- low others to do so, for Government purposes only. WebMedia’13, November 5–8, 2013, Salvador/BA, Brazil. Copyright 2013 ACM 978-1-4503-1706-1/12/10. search engines that allow users to declare their information needs by means of queries and find relevant information. To accomplish this task, ranking algorithms sort pages accord- ing to their scores. These scores are estimated based on various pieces of evidence. Among them, those based on the Web link structure play an important role. To them, the Web is modeled as a directed graph where nodes represent pages and edges represent links. The study of the properties and relations that can be inferred from this graph is called Link Analysis. According to a common intuition in Link Analysis, when the author of a page creates a link to another page, he en- dorses the content of the page he is pointing to, suggesting that it is of good quality and can be considered an authority on a particular subject [12]. In this work, we refer to this intuition as“quality hypothesis”. According to this hypoth- esis, the content quality of page p can be inferred by which and how many pages cite p, or equivalently, the reputation of p. This assumption contributes to the idea that the more a page is pointed to by another page, the higher the proba- bility that it is preferred by a user among several alternative pages on the same topic. As far as ranking the results of a query is concerned, the literature reports the use of several metrics based on Link Analysis [2]. The results of these metrics are usually eval- uated by human experts to check how relevant the pages returned are to the queries provided. Although the evalua- tion of these strategies confirms that Link Analysis is useful to rank pages according to their relevance, they do not con- firm that the “Quality hypothesis” holds. The difficulty lies in the fact that the degree of quality of a web page is very hard to obtain. As a result, as far as we know, this hypoth- esis has not been fully verified. Confirming that hypothesis is even harder when we ob- serve that quality is not the only factor that contributes for a page to be cited. Other factors investigated in litera- ture are the importance and the popularity of the page [6]. These factors, besides quality, compose central concepts to this work and deserve more careful discussion. Given a set of alternative options, the term quality refers to a level of excellence that, when assigned to a specific option, allows us to distinguish it from the others. However, quality per- ception is subjective, which implies that different users may assign different degrees of quality to a same page. In this sense, it would be more correct to say that, in addition to good quality, highly cited pages also have good reputation. Reputation refers to the opinion that people have about something. A concept related to reputation is popularity. 105