EXCITE – A toolchain to extract, match and publish open literature references Azam Hosseini GESIS ś Leibniz Institute for the Social Sciences azam.hosseini@gesis.org Behnam Ghavimi GESIS ś Leibniz Institute for the Social Sciences behnam.ghavimi@gesis.org Zeyd Boukhers University of Koblenz-Landau boukhers@uni-koblenz.de Philipp Mayr GESIS ś Leibniz Institute for the Social Sciences philipp.mayr@gesis.org ABSTRACT This demo paper presents a generic toolchain to extract, segment and match literature references from full text PDF fles in the project EXCITE. The aim of EXCITE is extracting and matching citations from social science publications and making more citation data available to researchers. Each single step in the EXCITE pipeline and the open source tools used to accomplish the tasks are ex- plained. The public demo system which integrates all components of the toolchain under an user-friendly interface is put forward and illustrated. As a fnal step, a special component is introduced which is capable to ingest the extracted and matched references into the Open Citation Corpus. KEYWORDS Reference Extraction, Reference Matching, Open Citations, Demo ACM Reference Format: Azam Hosseini, Behnam Ghavimi, Zeyd Boukhers, and Philipp Mayr. 2019. EXCITE ś A toolchain to extract, match and publish open literature ref- erences. In Proceedings of JCDL 2019: Joint Conference on Digital Libraries (JCDL 2019). ACM, New York, NY, USA, 2 pages. 1 INTRODUCTION Despite the widely acknowledged benefts of citation data, the open access to references/citations is still insufcient. Some commercial companies such as Clarivate Analytics, Elsevier or Google possess citation data in large-scale and use them to provide services for their users. On the other side, the shortage of citation data for the inter- national and German social sciences is well known to researchers in the feld and has itself often been subject to academic studies [5]. The accessibility of information in the social sciences lags behind other felds (e.g. the natural sciences) where more citation data is available. Recently, some initiatives and projects e.g. the "Open Citations" project or the "Initiative for Open Citations" focus on publishing citation data openly 1 . The "Extraction of Citations from PDF Doc- uments" - EXCITE 2 project is one of these projects. The aim of EXCITE is extracting and matching citations from social science publications [4] and making more citation data available to re- searchers. EXCITE is focusing on social science publications in 1 https://i4oc.org/ 2 http://excite.west.uni-koblenz.de/website/ Figure 1: An overview of processing steps and tools in the project EXCITE German language but is introducing a generic toolchain which can be used and trained for any domain. All tools in the EXCITE project are made available to other researchers. This demo paper introduces the EXCITE toolchain. 2 EXCITE TOOLCHAIN A number of algorithms are developed in the EXCITE project for extracting references from PDF full texts and matching them against bibliographic databases (see overview in Figure 1). The extraction of references is implemented as a four-steps process: (1) Extraction of text from PDF fles by CERMINE 3 , (2) Identifcation of reference strings and segmentation of refer- ences into its constituent felds such as author, title, etc. by Exparser [1] 4 , (3) Matching of references against bibliographic databases by EXmatcher [2] 5 , (4) Export and publication of references to reusable formats by conversion of the generated reference information to the json format with OCC ontology [6]. For the matching task in EXCITE, diferent target databases are utilized: a) sowiport [3], b) GESIS Search 6 and c) Crossref 7 . The EXCITE corpus (PDF fles to be processed in the EXCITE project) contains SSOAR 8 documents (approx. 35k), Springer Online Jour- nals collection (approx. 80k), and sowiport full text papers (approx. 116K). The extracted citation data from the EXCITE corpus will be integrated into GESIS Search and OCC (OpenCitations 9 Cor- pus). EXCITE toolchain is not depended to any citation style or 3 https://github.com/CeON/CERMINE 4 https://github.com/exciteproject/Exparser 5 https://github.com/exciteproject/EXmatcher 6 https://search.gesis.org/ 7 https://search.crossref.org 8 https://www.ssoar.info 9 http://opencitations.net