International Journal of Corpus Linguistics 21:3 (2016), 348–371. doi 10.1075/ijcl.21.3.03die
issn 1384–6655 / e-issn 1569–9811 © John Benjamins Publishing Company
Compiling computer-mediated
spoken language corpora
Key issues and recommendations
Stefan Diemer, Marie-Louise Brunner and Selina Schmidt
Saarland University
Tis paper discusses key issues in the compilation of spoken language corpora
in a computer-mediated communication (CMC) environment, using data from
the Corpus of Academic Spoken English (CASE), a corpus of Skype conversa-
tions currently being compiled at Saarland University, Germany, in cooperation
with European and US partners. Based on frst fndings, Skype is presented as a
suitable tool for collecting informal spoken data. In addition, new recommenda-
tions concerning data compilation and transcription are put forward to supple-
ment existing best practice as presented in Wynne (2005). We recommend the
preservation of multimodal features during anonymisation, and the addition of
annotation elements already at the transcription stage, particularly CMC-related
discourse features, English as a Lingua Franca (ELF) features (e.g. non-standard
language and code-switching), as well as the inclusion of prosodic, paralinguis-
tic, and non-verbal annotation. Additionally, we propose a layered corpus design
in order to allow researchers to focus on specifc annotation features.
Keywords: spoken language corpora, data compilation and transcription,
Computer-mediated communication (CMC), best practice, Skype
1. Introduction
In this paper, we look at key issues related to the compilation of spoken language
corpora in a computer-mediated communication (CMC) environment. It has
been more than ten years since Tompson (2005) addressed the issue of compil-
ing spoken corpora to establish best practice recommendations, and in the years
since the guidelines were published, there have been considerable changes both
in technology and in the quality of the linguistic data collected. In particular,
while much research has been focusing on written CMC, spoken conversations
ucl/5 IP: 144.82.108.120 On: Sun, 02 Apr 2017 12:37:14