Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, and Horacio Saggion Dept of Computer Science, University of Sheﬃeld, 211 Portobello St, Sheﬃeld, UK S1 4DP {kalina,diana,hamish,saggion}@dcs.shef.ac.uk Abstract. In this paper we show how we used robust human language technology, such as our domain-independent and customisable named entity recogniser, for automatic content annotation and indexing in two digital library applications. Each of these applications posed a unique challenge: one required adapting the language processing components to the non-standard written conventions of 18th century English, while the other presented the challenge of processing material in multiple modali- ties. This reusable technology could also form the basis for the creation of computational tools for the study of cultural heritage languages, such as Ancient Greek and Latin. 1 Introduction As digital libraries grow in size and coverage, so does the need for automatic con- tent annotation and indexing. Recent advances in human language technologies like named entity recognition, information extraction, and summarisation have made it possible to create automatically metadata (e.g., extract authors, titles) and document summaries, as well as annotate and index documents with infor- mation about persons, locations, dates, etc. These advances have been seen both in the quality of the results and in the robustness of the software solutions avail- able. An increased acceptance of the importance of engineering to the successful application of HLT has led to more predictable systems that can realistically be technology providers for Digital Library systems (which have high reusability and portability requirements). In the digital library context, especially cultural digital libraries (e.g., [8]), language technology can oﬀer new ways of accessing the collections (e.g., through indexes of events), as well as lowering the costs of annotating documents with metadata and other relevant information. While fully-automatic solutions might not be always possible or practical, HLT can frequently be used to bootstrap these laborious tasks. In this paper we show how we used such technologies for automatic content annotation and indexing in two digital library applications: eighteenth century court trials (OldBaileyIE) and a multilingual and multimodal collection on the Euro2000 football tournament (MUMIS). Each of these applications posed a