Annotated Bibliographical Reference Corpora in Digital Humanities Young-Min Kim Patrice Bellot Elodie Faath, Marin Dacos LIA, University of Avignon LSIS, Aix-Marseille University CLEO, Centre for Open Electronic Publishing 84911 Avignon France 13397 Marseille France 13331 Marseille, France young-min.kim@univ-avignon.fr patrice.bellot@lsis.org {elodie.faath, marin.dacos}@revues.org Abstract In this paper, we present new bibliographical reference corpora in digital humanities (DH) that have been developed under a research project, Robust and Language Independent Machine Learning Approaches for Automatic Annotation of Bibliographical References in DH Books supported by Google Digital Humanities Research Awards. The main target is the bibliographical references in the articles of Revues.org site, an oldest French online journal platform in DH field. Since the final object is to provide automatic links between related references and articles, the automatic recognition of reference fields like author and title is essential. These fields are therefore manually annotated using a set of carefully defined tags. After providing a full description of three corpora, which are separately constructed according to the difficulty level of annotation, we briefly introduce our experimental results on the first two corpora. A popular machine learning technique, Conditional Random Field (CRF) is used to build a model, which automatically annotates the fields of new references. In the experiments, we first establish a standard for defining features and labels adapted to our DH reference data. Then we show our new methodology against less structured references gives a meaningful result. Keywords: Bibliographical reference, Automatic annotation, Digital Humanities, Bilbo, Conditional Random Field, TEI 1. Introduction In this paper, we present new bibliographical reference cor- pora in digital humanities area. The corpora have been developed under a research project, Robust and Language Independent Machine Learning Approaches for Automatic Annotation of Bibliographical References in DH(Digital Humanities) Books supported by Google Digital Human- ities Research Awards. It is a R&D program for in-text bib- liographical references published on CLEO’s OpenEdition platforms 1 for electronic articles, books, scholarly blogs and resources in the humanities and social sciences. The program aims to construct a software environment enabling the recognition and automatic structuring of references in academic digital documentation whatever their biblio- graphic styles (Kim et al., 2011). Most of earlier studies on bibliographical reference anno- tation are intended for the bibliography part at the end of scientific articles that has a simple structure and rela- tively regular format for different fields. On the other side, some methods employ machine learning and numerical ap- proaches, by opposite to symbolic ones that require a large set of rules that could be very hard to manage and that are not language independent. Day et al. (2005) cite the works of a) Giles et al. (1998) for the CiteSeer system on computer science literature that achieves a 80% accuracy for author detection and 40% accuracy for page numbers (1997-1999), b) Seymore et al. (1999) that employ Hid- den Markov Models (HMMs) that learn generative models over input sequence and labeled sequence pairs to extract fields for the headers of computer science papers, c) Peng and McCallum (2006) that use Conditional Random Fields (CRFs) (Lafferty et al., 2001) for labeling and extracting fields from research paper headers and citations. Other ap- proaches employ discriminatively trained classifiers such 1 http://www.openedition.org as Support Vector Machine (SVM) classifiers (Joachims, 1999). Compared to HMM and SVM, CRF obtained better labeling performance. The main interest of our project is to provide automatic links between related references, articles and resources in OpenEdition site, which is composed of three different sub-platforms, Revues.org, Hypotheses.org and Calenda. The automatic link creation involves essentially automatic recognition of reference fields, which consist of author, ti- tle and date etc. Based on the correctly separated and rec- ognized fields, different techniques can be applied for the creation of cross-links. The initial work of this project mainly consists of the corpora construction, especially the manual annotation of reference fields. This is concerned with a detailed analysis of target data in OpenEdition. We start with Revues.org journal platform because it has the most abundant resources in terms of bibliographic refer- ences. Faced with the great variety of bibliographical styles present on the three platforms and the dissemination of references within texts, we have implemented a series of stages corresponding to the various issues encountered on the platforms. In the paper, we first detail the nature of Re- vues.org data that justifies our methodology, then describe the corpora construction process and finally we discuss the experimental results. In brief, we construct three different types of corpus with a detailed manual annotation using TEI guidelines. They will be a new valuable resource for research activities in natural language processing. There is no equivalent resource to date, neither in size nor in diversity. 2. Revues.org document properties Revues.org is the oldest French platform of online aca- demic journals. It now offers more than 300 journals avail- able in all disciplines of the humanities and social sciences, with predominance of history, anthropology and sociol- 494