Creating a Corpus of Auslan within an Australian National Corpus Trevor Johnston Macquarie University 1. Introduction The creation of signed language (SL) corpora presents special challenges to linguists. They are face-to-face visual-gestural languages that have no widely accepted written forms or standard specialist notation system, making even superficial transcription problematic. SL corpora need to be created taking these facts into account. Using the example of Auslan (Australian Sign Language) this paper describes how multimedia annotation software can now be used to transform a language recording into a machine-readable text without it first being transcribed, provided that conventional linguistic units are systematically and consistently identified, thus making possible the creation of a true linguistic corpus of a SL. Before examining SL annotation in detail, we first review the main features of modern linguistic corpora and introduce the Auslan archive which is the source of the future Auslan corpus. The paper concludes with an assessment of the place of an Auslan corpus within an Australian National Corpus and an evaluation of other recent SL corpus projects elsewhere in the world. A modern linguistic corpus is something more than just a dataset of written or transcribed texts upon which a description or an analysis of a language is based. This sense of corpus has now essentially been superseded in the literature (e.g., McEnery & Wilson, 2001; Sampson & McCarthy, 2004; Hoey, Mahlberg, Stubbs, & Teubert, 2007). A corpus in the modern sense means a collection of written and spoken texts in a machine-readable form that has been assembled for the purposes of studying the type and frequency of constructions in a language. A modern linguistic corpus contains linguistic annotations and appended sociolinguistic and sessional data (metadata) that describe the participants and the circumstances under which the data were collected. With the development of digitized video recording and multimedia annotation software, a corpus of a signed language (henceforth, SL) can now be described as a subtype of ‘spoken’ language corpora, namely face-to-face language. SL corpora promise to vastly improve peer review of descriptions of SLs and make possible, for the first time, a corpus-based approach to SL analysis. Corpora are important for the testing of language hypotheses in all language research at all levels, from phonology, through lexis, morphology, syntax, and pragmatics to discourse. There are several reasons why testing is particularly relevant in the field of SL studies. First, SLs, which are invariably young languages of minority communities, lack written forms and the well developed community- based standards of correctness that often accompany literacy. Second, they have interrupted generational transmission and few native speakers. Third, the representation of SL examples using written glosses has meant that primary data have remained essentially inaccessible to other researchers and consequently unavailable for meaningful peer review. Thus, although introspection and observation can still be of valuable assistance to linguists developing hypotheses regarding SL use and structure, one must also recognize that intuitions and researcher observations may fail in the absence of clear native signer consensus of phonological or grammatical typicality, markedness, or acceptability. The previous reliance on the intuitions of small numbers of informants has thus been problematic in the field. As with all modern linguistic corpora, SL corpora should be representative, well-documented, and machine-readable (McEnery & Wilson, 2001; Teubert & Cermáková, 2007). This not only requires dedicated technology and standards (e.g., Crasborn et al., 2007), it also requires a principled methodology for transcription or annotation. The guiding principle behind the linguistic annotations being created in the initial stages of an Auslan corpus is machine-readability, not transcription narrowly understood. The aim is to create an © 2009 Trevor Johnston. Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus, ed. Michael Haugh et al., 87-95. Somerville, MA: Cascadilla Proceedings Project.