Towards named entity annotation of Latvian National Library corpus Peteris PAIKENS 1 , Ilze AUZINA, Ginta GARKAJE and Madara PAEGLE University of Latvia, Institute of Mathematics and Computer Science Abstract. The paper describes a work in progress of building a catalogue of named entities – people, places and organizations – based on a recently digitized large (4.5 billion tokens) Latvian corpus. The authors propose an annotation standard for markup of named entities within Latvian corpus, according to which a representative set of documents (150 000 words) are manually annotated. This corpus is used for training and evaluation of an automated named entity recognition system based on Stanford CRF classifier, achieving an F-score of up to 81%. The named entities indexed within the Latvian National Library corpus and the annnotated documents are publicly available for linguistic and historical research online. Keywords. Named entity recognition, NER, Latvian, corpus indexing Introduction Recent digitizing of National Library of Latvia archives[1] has provided a valuable potential resource for researchers. In order to enable effective analysis and research, we aim to create a comprehensive catalogue of named entities mentioned in this corpus. It contains 240.000 books and newspapers (approx. 4.5 billion tokens) starting from 18 th century up to year 2008, with a particular focus on Latvian publications of 1920’ies and 1930’ies. The corpus also includes a number of locally printed historical works in German, Russian and other languages, but the majority (70%) of the data is in Latvian, making it the largest curently available Latvian corpus. The digitized data (scanned images and OCR results) is publicly available 2 and searchable. However, we consider common full-text indexing systems as not sufficient for enabling analysis of this corpus to the full extent. Linguistic analysis needs to take into account the morphological complexity of Latvian language, and in historical research the proper name spelling in documents would differ from searcher expectations due to historical reforms in Latvian orthography, morphology and also the large number OCR mistakes present in the digitized documents. Recognizing and properly indexing the named entities would provide a unique, valuable publicly available resource for Latvian historical, sociological and linguistic research. This paper describes current efforts and results in augmenting the raw text corpus by automated tagging of morphological and named entity information, and offering the analysis results as publicly available online services for further research. 1 Corresponding Author. 2 www.periodika.lv