Edinburgh Occasional Papers in Linguistics August 11 1995 The segmentation and labelling of speech databases Briony Williams briony@cstr.ed.ac.uk 1 Labelling at the segmental level 1.1 Introduction Segmentation is the division of a speech ﬁle into non-overlapping sections corresponding to physical or linguistic units. Labelling is the assignment of physical or linguistic labels to these units. Both segmentation and labelling form a major part of current work in linguistic databases. 1.1.1 Segmental transcription The term ‘transcription’ may be used to refer to the representation of a text or an utterance as a string of symbols, without any linkage to the acoustic representation of the utterance. This was the pattern followed by speech and text corpus work during the 1980’s, such as the prosodically-transcribed Spoken English Corpus (Knowles et al. 1995). These corpora did not link the symbolic representation with the physical acoustic waveform, and hence were not fully machine-readable. A recent project, MARSEC (Roach et al. 1993), has generated these links for the Spoken English Corpus such that it is now a segmented and labelled database. This is the form that is most useful to researchers in speech and language technology. The types of segments that may be delimited are of various kinds, depending on the pur- pose for which the database is collected. The German PHONDAT and Verbmobil-PHONDAT corpora use the CRIL (Computer Representation of Individual Languages) conventions for- mulated by a working group at the 1991 Kiel convention of the International Phonetic Association. These conventions propose three levels of representation: orthographic, pho- netic and narrow phonetic. The orthographic level contains the orthographic representation of the spoken text. The phonetic level speciﬁes the phonetic form of a word in citation form. The narrow phonetic level gives the phonetic labelling of the particular token of the word that was recorded. A more detailed system of levels of labelling has been proposed by Barry & Fourcin 1992, which includes the above three levels. Each given speech corpus will choose one or more of these levels, which are described in detail below, and which grew out of the SAM project for the major European languages. The format of label (transcription) ﬁles varies widely across research institutions. The WAVES format is becoming popular, and has the advantage of being human-readable. It is advisable to use a label ﬁle format that can easily be converted to a WAVES label ﬁle, for the sake of portability across different systems. During the International Conference on Spoken Language Processing (ICSLP) in Banff in 1992, a workshop was held on Orthographic and Phonetic Transcription. The workshop goals were to agree on areas where community-wide conventions are needed, to identify 1