Evaluating and Integrating Databases in the Area of NLP Wahed Hemati, Alexander Mehler, Tolga Uslu, Daniel Baumartz, Giuseppe Abrami Goethe University Frankfurt Frankfurt, Germany {hemati, mehler, uslu, baumartz, abrami}@em.uni-frankfurt.de Abstract Since computational power is rapidly increasing, analyzing big data is getting more popular. This is exemplified by word embeddings producing huge index files of interrelated items. Another example is given by digital editions of corpora representing data on nested levels of text structuring. A third example relates to annotations of multimodal communication comprising nested and networked data of various (e.g., gestural or linguistic) modes. While the first example relates to graph-based models, the second one requires document models in the tradition of TEI whereas the third one combines both models. A central question is how to store and process such big and diverse data to support NLP and related routines in an efficient manner. In this paper, we evaluate six Database Management Systems as candidates for answering this question. This is done by regarding database operations in the context of six NLP routines. We show that none of the DBMS consistently works best. Rather, a family of them manifesting different database paradigms is required to cope with the need of processing big and divergent data. To this end, the paper introduces a web-based multi-database management system (MDBMS) as an interface to varieties of such databases. Keywords: Multi-Database Systems, Big Humanities Data, Natural Language Processing, Database Evaluation 1. Introduction Digital humanities and related disciplines deal with a vari- ety of data ranging from large index files (as in the case of word embeddings (Mikolov et al., 2013)) to inclusion hier- archies as mapped by document models in the tradition of TEI (Burnard and Bauman, 2007) and models of nested and networked data as exemplified by multimodal communica- tion (Carletta et al., 2003). With the availability of web- based resources we observe both an increase concerning the size of such data – that is, in terms of big data – and the need to cope with this diversity simultaneously within the same project: lexica (represented as networks) are in- terrelated, for example, with large corpora of natural lan- guage texts (represented as tree-like structures) to observe language change (Kim et al., 2014; Michel et al., 2011; Eger and Mehler, 2016). Another example is given by digi- tal editions (Kuczera, ) in which annotations of textual data are interrelated with annotations of pictorial representations of the underlying sources (Leydier et al., 2014; Lavrentiev et al., 2015). We may also think of text mining in the area of learning analytics based on big learning data, in which textual manifestations of task descriptions are interrelated with student assessments and qualitative data (Robinson et al., 2016; Crossley et al., 2015). To cope with such diversi- ties, databases are required that allow for efficiently storing, retrieving and manipulating data ranging from distributions to tree-like structures and graphs of symbolic and numeri- cal data. As a matter of fact, such an omnipotent database is still out of reach. Thus, the question is raised which ex- isting database technology based on which paradigm per- forms best in serving these tasks. Alternatively, the ques- tion is raised which combination of these technologies best fits these tasks. In this paper, we evaluate six Database Management Sys- tems (DBMS) as candidates for answering these two ques- tions. This is done by regarding elementary database oper- ations that underly typical scenarios in NLP or natural lan- guage annotation. We show that none of the DBMS under consideration consistently works best. Rather, families of such DBMS manifesting different database paradigms cope better with the rising needs of processing big as well as di- vergent data. To demonstrate this, the paper introduces a web-based multi-database management system (MDBMS) that allows for integrating families of databases. This is done to provide a single interface for querying different databases. To this end, the MDBMS encapsulates the specifics of the query languages of all databases involved. By using it, modelers are no longer forced to choose a sin- gle database, but can benefit from different databases that capture a variety of data modeling paradigms. This en- ables projects to simultaneously manage and query a va- riety of data by means of a multitude of databases – in a way that also allows for processing big data. Such a versa- tile MDBMS is especially interesting for NLP frameworks (Castilho and Gurevych, 2014; Hemati et al., 2016; Hin- richs et al., 2010; OpenNLP, 2010; Popel and ˇ Zabokrtsk´ y, 2010) that still do not store annotations of documents in a query-able format. In cases where such annotations are stored according to the UIMA Common Analysis Structure (UIMA-CAS), serialization is based on XML-files. This makes operations such as collecting statistical information from documents very time-consuming, since one first has to deserialize each file involved. The same problem occurs when trying to add annotation layers. For additionally an- notating, for example, named entities, one has two options: either one runs the underlying UIMA pipeline together with the named-entity recognition (NER) from scratch or the serialized output of the pipeline is deserialized and then made input to NER. Obviously, both variants are very time- consuming. A better option would be to let NER work only on those annotations that are needed for annotating named entities. However, this presupposes that all annotations are