© 2018 Björn Barz, Thomas C. van Dijk, Bert Spaan, Joachim Denzler. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The defnitive version was published in the Proceedings of the 2nd ACM SIGSPATIAL Workshop on Geospatial Humanities, https://doi.org/10.1145/3282933.3282937. Puting User Reputation on the Map: Unsupervised Qality Control for Crowdsourced Historical Data Björn Barz Lehrstuhl für Digitale Bildverarbeitung Friedrich Schiller University Jena bjoern.barz@uni-jena.de Thomas C. van Dijk Lehrstuhl für Informatik I Würzburg University thomas.van.dijk@uni-wuerzburg.de Bert Spaan Independent Map and Data Engineer Amsterdam bertspaan.nl Joachim Denzler Lehrstuhl für Digitale Bildverarbeitung Friedrich Schiller University Jena joachim.denzler@uni-jena.de ABSTRACT In this paper we propose a novel method for quality assessment of crowdsourced data. It computes user reputation scores without re- quiring ground truth; instead, it is based on the consistency among users. In this pilot study, we perform some explorative data analysis on two real crowdsourcing projects by the New York Public Library: extracting building footprints as polygons from historical insurance atlases, and geolocating historical photographs. We show that the computed reputation scores are plausible and furthermore provide insight into user behavior. CCS CONCEPTS · Information systems → Geographic information systems; Crowdsourcing; Reputation systems; KEYWORDS historical data, crowdsourcing, quality control, user reputation ACM Reference Format: Björn Barz, Thomas C. van Dijk, Bert Spaan, and Joachim Denzler. 2018. Putting User Reputation on the Map: Unsupervised Quality Control for Crowdsourced Historical Data. In 2nd ACM SIGSPATIAL Workshop on Geospa- tial Humanities (GeoHumanities’18), November 6, 2018, Seattle, WA, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3282933. 3282937 1 INTRODUCTION AND RELATED WORK The use of crowdsourcing provides a big opportunity for the digital humanities in general, and the mapping of historical documents specifcally. However, it comes with major concerns about data qualityÐsee for example a recent survey by Daniel et al. [4]. In this paper, we propose a novel method for quality assessment, based on a notion of consistency: good users are likely to give answers that are similar to the answers of other good users. This apparently circular Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or afliate of a national govern- ment. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. GeoHumanities’18, November 6, 2018, Seattle, WA, USA © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6032-6/18/11. . . $15.00 https://doi.org/10.1145/3282933.3282937 defnition is resolved by fnding a stationary vector of reputations, similar to the popular PageRank algorithm [11]. This approach allows us to determine a user reputation based on work items for which a sufcient number of diferent users have sub- mitted an answer. We can subsequently use this knowledge about a user’s performance to more accurately assess the quality of answers for work items for which only a few users have provided an answer. This enables a kind of smart crowdsourcing, where algorithms are used to increase the value of volunteered information. In the Daniel et al. taxonomy, our method is a computation- based assessment. Our conceptual starting point is close to their concept of łoutput agreement,ž but we take it in a diferent direction. For example, various authors have considered the game-theoretic aspects of two crowd workers answering the same question [8, 14]. Instead, we take a global view of the whole dataset and do a post-hoc analysis. Rajasekharan et al. [12] and Faisal et al. [5] use similar ideas in the context of open-source programming forums. However, they rely on direct interactions between users such as comments or up- and down-votes. This is a property of many existing user reputation systems: they employ some kind of control instance evaluating the crowd workers. This control instance could either come from external experts or the workers themselves who judge the work of others (e.g.,[1]). However, fnding and remunerating experts can be difcult in settings requiring highly specifc domain-knowledge. On the other hand, allowing the crowd to evaluate itself opens up a variety of possibilities for manipulation [13]. In contrast, we follow a content-driven approach for evaluating the work of each user without any additional input about its perfor- mance. Instead, it is based solely on the agreement between users. As a result, our method is more generally suited to the many kinds of crowdsourced historical data. Similar in spirit but methodically entirely diferent is the STAPLE algorithm for crowdsourced image segmentation [15]. It computes consensus and user performance ratings simultaneously using the expectation-maximization algorithm: First, a probabilistic consen- sus is approximated, and then the performance of the annotators is assessed by comparing them to this consensus. A new consensus is then estimated based on the user performance computed in the previous step. These two alternating steps are iterated until conver- gence. This approach does not only provide user ratings, but also a consensus. However, it also requires the defnition of a suitable