Creating a Corpus of Auslan within an Australian
National Corpus
Trevor Johnston
Macquarie University
1. Introduction
The creation of signed language (SL) corpora presents special challenges to linguists. They are
face-to-face visual-gestural languages that have no widely accepted written forms or standard specialist
notation system, making even superficial transcription problematic. SL corpora need to be created
taking these facts into account. Using the example of Auslan (Australian Sign Language) this paper
describes how multimedia annotation software can now be used to transform a language recording into
a machine-readable text without it first being transcribed, provided that conventional linguistic units are
systematically and consistently identified, thus making possible the creation of a true linguistic corpus
of a SL. Before examining SL annotation in detail, we first review the main features of modern
linguistic corpora and introduce the Auslan archive which is the source of the future Auslan corpus.
The paper concludes with an assessment of the place of an Auslan corpus within an Australian National
Corpus and an evaluation of other recent SL corpus projects elsewhere in the world.
A modern linguistic corpus is something more than just a dataset of written or transcribed texts
upon which a description or an analysis of a language is based. This sense of corpus has now essentially
been superseded in the literature (e.g., McEnery & Wilson, 2001; Sampson & McCarthy, 2004; Hoey,
Mahlberg, Stubbs, & Teubert, 2007). A corpus in the modern sense means a collection of written and
spoken texts in a machine-readable form that has been assembled for the purposes of studying the type
and frequency of constructions in a language. A modern linguistic corpus contains linguistic
annotations and appended sociolinguistic and sessional data (metadata) that describe the participants
and the circumstances under which the data were collected. With the development of digitized video
recording and multimedia annotation software, a corpus of a signed language (henceforth, SL) can now
be described as a subtype of ‘spoken’ language corpora, namely face-to-face language. SL corpora
promise to vastly improve peer review of descriptions of SLs and make possible, for the first time, a
corpus-based approach to SL analysis.
Corpora are important for the testing of language hypotheses in all language research at all levels,
from phonology, through lexis, morphology, syntax, and pragmatics to discourse. There are several
reasons why testing is particularly relevant in the field of SL studies. First, SLs, which are invariably
young languages of minority communities, lack written forms and the well developed community-
based standards of correctness that often accompany literacy. Second, they have interrupted
generational transmission and few native speakers. Third, the representation of SL examples using
written glosses has meant that primary data have remained essentially inaccessible to other researchers
and consequently unavailable for meaningful peer review. Thus, although introspection and observation
can still be of valuable assistance to linguists developing hypotheses regarding SL use and structure,
one must also recognize that intuitions and researcher observations may fail in the absence of clear
native signer consensus of phonological or grammatical typicality, markedness, or acceptability. The
previous reliance on the intuitions of small numbers of informants has thus been problematic in the
field. As with all modern linguistic corpora, SL corpora should be representative, well-documented, and
machine-readable (McEnery & Wilson, 2001; Teubert & Cermáková, 2007). This not only requires
dedicated technology and standards (e.g., Crasborn et al., 2007), it also requires a principled
methodology for transcription or annotation.
The guiding principle behind the linguistic annotations being created in the initial stages of an
Auslan corpus is machine-readability, not transcription narrowly understood. The aim is to create an
© 2009 Trevor Johnston. Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian
National Corpus, ed. Michael Haugh et al., 87-95. Somerville, MA: Cascadilla Proceedings Project.