Resources for Translation Representation of User Activity Data and Treebanks Michael Carl Copenhagen Business School Copenhagen, Denmark Henrik Høeg Müller Copenhagen Business School Copenhagen, Denmark Abstract The CRITT 1 environment at Copenhagen Business School (CBS) draws on primarily two types of NLP resources, namely treebanks and the logging of user activity data (UAD), in order to do research into the cognitive processes that lie behind translation activity. In this paper we make a short presentation of the Copenhagen Dependency Treebank (CDT) 2 , and we give a more elaborated account of how UAD is represented in Translog-II. Finally, the paper discusses some general perspectives on how process-oriented translation research methodology could benefit from the integration of UAD with structural linguistic information in the form of linguistically annotated text data. Keywords: translation, eyetracking, keystroke logging, parallel multilingual treebanks 1. Introduction The main focus of the CRITT research environment is on the empirical and experimental study of translation processes with an applied, technological aim. Our research designs involve data elicitation methods (keystroke logging) and behavioural measuring technologies (eyetracking), as well as parallel linguistically annotated text collections (treebanks). To record user activity data (UAD), CRITT has developed the computer programme Translog-II which logs keystrokes, mouse activities and gaze movements during text production. With respect to treebanks, the CRITT translation research programme has devised the Copenhagen Dependency Treebank (CDT), an NLP resource which provides information about language structure and meaning on various levels. While the CDT annotates the static structure of the parallel, translated texts, Translog-II provides information on how the parallel data was actually created during the translation process. Occasionally, it has been observed that bilingual data resources are taken for granted, for instance by the MT community. Although some developers within the MT community don’t seem to acknowledge the amount of work and the processes by which this data is generated (Way, 2009), there is hardly any doubt that the quality and origin of the bilingual data has an impact on the translation quality and usability of MT systems. Since around the mid 90s most texts are generated by humans using a (computer) keyboard, but still there is hardly any empirical data available suited to investigate how translations are generated, and to uncover and describe the processes by which humans produce translations. A central aim of CRITT is to overcome this gap. In this paper, we seek to connect the two worlds of product- and process annotation. The paper is structured as follows. In Section 2, it is illustrated, very briefly, how linguistic structure is annotated in CDT, including alignment. In Section 3, focus is on how UAD is structured in Translog-II, and section 4 offers some speculations about the possible benefits derived from integrating CDT and Translog-II. Finally, section 5 sums up the central points. 2. CDT The Copenhagen Dependency Treebank (CDT) (Kromann, 2003) is a multilingual open NLP resource which consists of linguistically annotated parallel text collections of approx. 60.000 words each for Danish, English, German, Italian and Spanish. The CDT is based on a unified dependency annotation which includes not only syntax, but also fine-grained analyses of morphological, discourse and anaphoric structure. Moreover, in order to extent its applicability potential to MT, the resource has an alignment system of translational equivalences that allows us to specify relations between 1 Center for Research and Innovation in Translation and Translation and Translation Technology (see http://www.cbs.dk/en/Research/Departments-Centres/Institutter/CRITT ). 2 The project is hosted on Google Code – http://code.google.com/p/copenhagen-dependency-treebank/ – and all the sources are freely available.