Semantic Annotation in the Project “Open Access Database ‘Adjective-Adverb Interfaces’ in Romance” Christopher Pollin, Gerlinde Schneider, Katharina Gerhalter, Martin Hummel Centre for Information Modeling & Institute for Romance Studies, University of Graz Elisabethstraße 59/III, 8010 Graz, Merangasse 70, 8010 Graz {christopher.pollin, gerlinde.schneider, katharina.gerhalter, martin.hummel}@uni-graz.at Abstract This paper describes the creation, the annotation process and the model of the Open Access Database 'Adjective-Adverb Interfaces in Romance’ (AAIF) project, with its approach to the creation of a domain-specific ontology. In order to make research data accessible, interoperable, extensible, and transferable, data is annotated in TEI/XML, formalized and enriched with RDF and its conceptual data model is stored in and published via the GAMS digital repository. This produces semantically-enriched, annotated multilingual research data that allows retrieval across heterogeneous corpora. The annotation model expressed in the ontology is offered for further reuse. Keywords: annotated data, open access, semantic enrichment, ontology based, RDF, TEI, GAMS 1. Introduction Annotation has always played a crucial role in humanities textual scholarship as well as in linguistic research; increasing with the development of digital methods and tools. For this reason, research data in these areas very often consist of annotated text in various form. The taxonomy TaDiRAH 1 describes the digital research practice of annotating as the ‘activity of making information about a digital object explicit by adding, e.g., comments, metadata or keywords [...]’. Schöch (2013) distinguishes between two types of data in the context of research in the humanities: big data and smart data. The former is unstructured, implicit, large in volume, and varied in form. The latter is semi-structured or structured, explicit, small in scale and of limited heterogeneity. According to these criteria, annotated linguistic corpora are smart data. The data the project Open Access Database 'Adjective- Adverb Interfaces in Romance’ (AAIF) 2 deals with are complex linguistic annotations. The project aims to survey the possibilities and challenges of open data and open access with regard to linguistic research data. The project focuses on the interoperability and accessibility of data, with particular respect to reusability in the sense of the FAIR 3 Data Principles. Topics discussed by this paper include data creation, annotation, data preservation and publication process by means of the GAMS 4 repository and accessibility via a search interface. These aspects are tied together by semantic technologies, using an ontology- based approach that is relevant to other domains of digital data. In the following, we want to investigate the application of semantic technologies to meet the challenges described above. 1 Taxonomy of Digital Research Activities in the Humanities, http://tadirah.dariah.eu/vocab 2 https://adjective-adverb.uni-graz.at/en/research/projects/open-access-database 3 https://www.force11.org/group/fairgroup/fairprinciples 4 http://gams.uni-graz.at 5 https://adjective-adverb.uni-graz.at 2. Project and Challenges Funding authority policy, as well as a re-thinking in research communities, has led to a situation where more and more richly annotated research data is becoming openly accessible and integrable. AAIF, a project within the Austrian Science Fund programme Open Research Data Pilot, focuses on how to publish linguistically annotated data to make it reusable within and outside of the domain while making the underlying annotation model available. The project builds upon the work of the Research group on Interfaces of Adjective and Adverb in Romance. 5 In the course of the project, different corpora, each annotated with respect to the complex relations between the word classes of adjective and adverb in Romance languages, are going to be integrated to one comprehensive database. This will enable querying across corpora and languages and thus allow for cross-linguistic generalizations. The expandability of the system for new data has to be considered during the whole process. As the corpora were compiled and annotated in response to diverse, very specific research questions within the domain, the degree and emphasis of the annotation varies. Adjective-adverb phrases can have a very flat annotation, where for example, only one adverb and verb are marked and lemmatized; others are very extensively annotated with semantic and morphosyntactic information. Additionally, the applied annotation model has been developed further over time. With more diverse research questions and a deeper understanding of the field, some categories were added and changed. All this results in data that is annotated very heterogeneously and will remain so in the future. These issues significantly complicate the endeavor of integrating all data into one database while concurrently preserving the rich annotation each corpus holds. 41