Using GrAF for Advanced Convertibility of IGT data Dorothee Beermann, Peter Bouda Norwegian University of Science and Technology, Centro Interdisciplinar de Documentao Lingustica e Social dorothee.beermann@ntnu.no, pbouda@cidles.eu Abstract With the growing availability of multi-lingual, multi-media and multi-layered corpora also for lesser-documented languages, and with a growing number of tools that allow their exploitation, working with corpora has attracted the interest of researchers from the theoretical as well as the applied fields. But always when information from different sources is combined, the pertaining lack of interoperability represents a problem. This is in particular a challenge for corpora from endangered and lesser described languages since they often originate from work by individual researchers and small projects, using different methodologies and different tools. Before this material can become a true resource, well-known differences in the physical as well as the conceptual data structure must be leveraged against ways of future data use and exploitation. Working with Interlinear Glossed Text (IGT), which is a common annotation format for linguistic data from lesser described languages, we will use GrAF to achieve Advanced Con- vertibility. Our goal is to build data bridges between a number of linguistic tools. As a result data will become mobile across applications. Keywords: Language Documentation, Interlinear Glossed Text, Natural Language Processing 1. Introduction Convertibility is dependent on the data’s physical and con- ceptual structure. In this paper we would like to focus on the latter. Using the term Advanced Glossing, (Drude, 2002) suggests as a de dicto standard a fixed set of annota- tion tiers across several annotation tables to allow a con- ceptually cleaner albeit comprehensive linguistic annota- tion tailored to the needs of documenting linguists. Using the ”Graph Annotation Framework” (GrAF) (as described in (Ide and Suderman, 2007)) we would like to promote the flexible integration of de facto standards instead. The idea is to present a relatively simple model for the presentation of analytic layers. Yet, our approach is confronted with multiple challenges which arise from the particular nature of Interlinear Glossed Texts (IGTs). They do not only dif- fer in terms of the concepts they encode, but also in the way these concepts are expressed across tiers. Using GrAF, via the software library Poio API 1 , we would like to show that Advanced Convertibility allows us to stay within a given ”semantics” (nodes, annotations, edges) when converting from one format into another. ”Graph semantics” is learned and can be applied to any transformation. This makes our approach sustainable. IGT normally consists of 3-5 lines, which are also called ”tiers” 2 . Originally, researchers used IGT in descriptive linguistics and related disciplines to discuss features of lan- guages in articles and books. An IGT was, at least in the more empirically oriented fields of linguistics, regarded as evidence in the evaluation of a hypothesis. Example (1) shows a typical IGT: (1) Example from Kaguru (ISO-693-3 kki) 1 http://media.cidles.eu/poio/poio-api/, ac- cessed 19.2.2014 2 For a discussion of this data format its variants and its prob- lems see for example (Bow Catherine and Bird, 2003), (Palmer and Erk, 2007) and (Beermann and Mihaylov, 2013). Kamei kamei then adv howoluta ha-wa-lut-a PAST-2-go-FV tm-sm-v-fv kunyumbangwa ku-nyumba-ngwa 17-house:9/10-somebody’s sm-n-prn imwe, di-mwe 5-one ncp-num Then they went to one house In this case the example conists of tiers for ”words”, ”mor- phemes”, ”(morpho-syntactic) glosses”, ”part-of-speech” and ”free translation” that are partly aligned vertically. There were several attempts to standardize IGT, as for ex- ample in the ”Leipzig Glossing Rules” 3 as the most promi- nent example. But none of those attempts where accepted by the community and today a diversity of tier names, tier structures and annotation schemes co-exist in published data. This is one of the most important reasons why lin- guists struggle to analyse and compare data from different projects. In our paper we want to demonstrate how GrAF as a pivot data model can support researchers to exchange and analyse data in different file formats and with different tools. 2. Annotation graphs as pivot model Our approach depends on the use of annotation graphs, i.e. the recently standardized implementation of the Linguistic Annotation Framework (LAF) as described in ISO 24612 4 . LAF was developed as an underlying data model for lin- guistic annotations designed to allow a better insight into 3 http://www.eva.mpg.de/lingua/resources/ glossing-rules.php, accessed 26.3.2014 4 ”Language resource management - Linguistic annotation framework”, http://www.iso.org/iso/catalogue_ detail.htm?csnumber=37326, accessed 3.2.2014