Code, comments and consistency, a case study of the problems of reuse of encoded texts Claire WARWICK, George BUCHANAN, Jeremy GOW, Ann BLANDFORD, Jon RIMMER Introduction It has long been an article of faith in computing that when a resource, a program or code is being created, it ought to be documented. (Raskin, 2005) It is also an article of faith in humanities computing that the markup should be non-platform-specific (e.g. SGML or XML). One important reason for both practices is to make reuse of resources easier, especially when the user may have no knowledge of or access to the original resource creator. (Morrison et al, nd, chapter 4) However, our paper describes the problems that may emerge when such good practice is not followed. Through a case study of our experience on the UCIS project, we demonstrate why documentation, commenting code and the accurate use of SGML and XML markup are vital if there is to be realistic hope of reusing digital resources. Background to the Project: The UCIS project (www.uclic.ucl.ac.uk/annb/DLUsability/UCIS) is studying the way that humanities researchers interact with digital library environments. We aim to find out how the contents and interface of such collections affect the way that humanities scholars use them, and what factors inhibit their use. (Warwick, et al., 2005) An early work-package of the project was to build a digital text collection for humanities users, delivered via the Greenstone digital library system. We chose to use texts from the Oxford Text Archive, (OTA) because this substantial collection is freely available and contains at least basic levels of XML markup. However this task was to prove unexpectedly difficult, for reasons that extend beyond the particular concerns of UCIS. Findings On examination of a sample of the files, we found that although they appeared to be in well formed XML, there were many inconsistencies in the markup. These inconsistencies often arise from the electronic history of the documents. The markup of older (Early and Middle) English texts is complex, and many of the problems stem from succeeding revisions to the underlying content. One common early standard was Cocoa markup, and many of the documents still contain Cocoa tags which meant that the files would not parse as XML. In Cocoa, the (human) encoder can provide tags that indicate parts of the original document, their form and clarity. These tags were retained in their original Cocoa form which was mistaken for potential TEI tags by the processing software. Many characters found in earlier English were encoded using idiosyncratic forms where modern (Unicode or SGML Entity) alternatives now exist. The earlier, Cocoa, form may render the modern electronic encoding unparsable in either XML or SGML. Another problem with Cocoa markup is that it was never fully standardised, and tags are often created or used idiosyncratically. (Lancashire, 1996) This complicates a number of potential technical solutions (e.g. the use of XML namespaces). Some content included unique tags such as “<Cynniges>”: not part of any acknowledged hybrid of the original standard. The nature of this is unclear. It may be an original part of the text, (words actually surrounded by ‘<’ and ‘>’), a Cocoa tag, or a TEI/SGML/XML tag. The distinction of forms known to a modern TEI/XML document is straightforward; the distinction between Cocoa and SGML/XML is not possible in this context.