Specifying a Dependency Representation with a Grammar Definition Corpus Atro Voutilainen and Krister Linden Department of Modern Languages, University of Helsinki first.last@helsinki.fi Abstract We outline the design and creation of a syntactically and morphologically annotated corpora of Finnish for use by the research community. We motivate a definitional, systematic “grammar definition corpus” as a basic step in an three- year annotation effort to help create systematically documented extensive parsebanks. The syntactic representation, consisting of a dependency structure and a basic set of dependency functions, is outlined with examples. 1.Background This article focuses on designing a grammar definition corpus for Finnish, but first we need to say something about the purpose and context of the effort. 1.1 Treebank, Parsebank, Grammar Definition Corpus A Treebank can be described as a set of sentences syntactically annotated by trained linguists. A hand-annotated Treebank is restricted in size, of high annotaation quality and consistency, and represents running text sentences and/or selected sentences illustrating various syntactic structures of the language. The PARC 700 Dependency Bank is a good example of a manually annotated Treebank, with a set of 700 text sentences annotated manually according to a form of Lexical Functional Grammar (King et al, 2003). A Parsebank can be characterized by a large amount of sentences that have been mechanically annotated (with a parser), and the annotating parser has repeatedly been modified by sampling the output to correct mistakes and gradually create a better Parsebank. In order to create a high-quality Parsebank, we need documentation and examples on the linguistic representation and its use in text analysis. A hand- annotated set of sentences is useful, but in order to approximate the structures that are used in a large corpus of text in a more comprehensive and systematic way, we need a more exhaustive and systematic set of sentences to be analysed