The CEDAR Project: Publishing and Consuming Harmonized Census Data Albert MeroñoPeñuela 1 , Ashkan Ashkpour 2 , Christophe Guéret 1 and Andrea Scharnhorst 1 1 Data Archiving and Networked Services (DANS), Den Haag, NL 2 International Institute of Social History, Amsterdam, NL {albert.merono, christophe.gueret, andrea.scharnhorst}@dans.knaw.nl ashkan.ashkpour@iisg.nl Abstract. This paper discusses the use of semantic technologies to increase quality, machineprocessability, format translatability and crossquerying of complex tabular datasets often found in many research areas of the Humanities. In particular, we are interested in enabling longitudinal studies of social processes in the past. We use the historical Dutch censuses as casestudy: census data is currently digitized, but it is notoriously difficult to compare, aggregate and query in a uniform fashion. We describe an approach to achieve these goals, emphasizing open problems and tradeoffs. 1. Introduction Census data plays an invaluable role in the historical study of society. In the Netherlands, the Dutch historical censuses (17951971) are among the most frequently consulted sources of statistics of the Central Bureau of Statistics [1]. During the period they cover, it is the only 1 regular statistical population study performed by the Dutch government, and the only historical data on population characteristics that is not strongly distorted. Over the past decades there have been a lot of digitization efforts across the world in bringing (historical) census data to researchers. Historical censuses comprise very detailed data about specific categories and variables in a certain time period in history, which is why historians are greatly interested in the digitizing of historical census data. The censuses are a rich source of historical information for researchers providing demographic, social and economic structures, yielding a wealth of data on many issues in the course of time [4]. The dataset comprises 507 Excel workbooks with 2,288 tables, being a digital representation of a partial subset of the original census books, which only contain aggregated data. These books have been translated to several digital formats in subsequent stages: first as scanned images, then as PDF documents, and finally as Excel tabular files. The dataset is archived at DANS’ archiving system EASY , and publicly available at www.volkstellingen.nl as open data. 2 1 http://www.cbs.nl 2 http://easy.dans.knaw.nl