Characterizing Wikipedia Pages Using Edit Network Motif Profiles Guangyu Wu, Martin Harrigan, Pádraig Cunningham Clique Research Cluster School of Computer Science & Informatics University College Dublin, Ireland {guangyu.wu, martin.harrigan, padraig.cunningham}@ucd.ie ABSTRACT Good Wikipedia articles are authoritative sources because a number of knowledgeable contributors have collaborated to produce an authoritative article on a topic. This is the many eyes idea. The edit network associated with a Wikipedia article can tell us something about the quality or authorita- tiveness of a Wikipedia article. In this paper we explore the hypothesis that the characteristics of this edit network are predictive of the quality of the content. We characterize the edit network using the profile of network motifs it contains and we show that this network motif profile is predictive of the Wikipedia quality class – articles in well-curated parts of Wikipedia are assigned a quality category by editors. We further show that the network motif profile can identify out- lier articles particularly in the Featured Article class, the highest Wikipedia quality class. Categories and Subject Descriptors H.5.0 [Information Interfaces and Presentation]: Gen- eral; H.2.8 [Database Management]: Data Mining 1. INTRODUCTION While the impact and widespread uptake of Wikipedia is un- deniable, the reliability of much of the content available on Wikipedia has been the subject of intense debate. Perhaps the epicentre of this debate has been the Nature article fa- vorably comparing the quality of Wikipedia entries to those in the Encyclopdia Britannica [6] and the robust response to this article from the Britannica owners 1 . The truth is that sometimes Wikipedia is effective at harnessing the wisdom of crowds to produce excellent authoritative articles and some- times Wikipedia articles are quite poor – often because they are not the result of much collaboration. In this paper we set out to examine to what extent the qual- 1 Fatally Flawed: Refuting the Recent Study on Encyclo- pedic Accuracy by the Journal Nature, http://corporate. britannica.com/britannica_nature_response.pdf ity of a Wikipedia article can be determined from looking at the network of edits around an article. We explore the hy- pothesis that a characterization of the edit network in terms of network motif profiles will be predictive of the quality of the article. The network motif profile comprises a count of the occurrences of the 17 network motifs shown in Figure 2. Two example edit networks are shown in Figure 1. Since these networks are centered on a specific article they can be considered ego networks for that article. Both ego networks are from the article covering the 2011 Japanese earthquake and tsunami 2 . The network on the left is from two days after the disaster and the one on the right is after ten days. The article on the left was rated C class on the Wikipedia quality scale (see section 4) while the rating had raised to B class after ten days. A characteristic of the C class arti- cle is that (at least at that stage) the contributors had not collaborated on other Wikipedia articles. While the majority of Wikipedia articles carry these quality class markers, there is a vast array of Wikipedia content that is not rated and much of it is of dubious or poor quality. This is a problem in Wikipedia articles on entertainment, sport and popular culture but it is even a problem in science ar- ticles where Wikipedia normally works well. Two examples relevant for this paper are the articles on network motifs and the k-nearest neighbor algorithm. At the time of writ- ing, these articles are not necessarily inaccurate but they are not comprehensive or authoritative. It is clear from an ex- amination of the edit history of these articles that they have not received much attention from Wikipedia contributors. In the next section we provide a brief review of some relevant research on social network analysis and authoritativeness. In section 3 we outline the procedures we employ to represent Wikipedia articles in terms of edit networks. Section 4 pro- vides some background on the Wikipedia quality scale and describes the datasets we use in our study. The usefulness of this network motif-based representation for classification is evaluated in section 5 and an assessment of the discrimi- nating power of the different network motifs is presented in section 6. We visualize the data using 2-D spatializations in section 7. The paper concludes in section 8 with a summary and an outline of future work. 2 The T¯ ohoku earthquake and tsunami: http://bit.ly/grGP84