WandaML a markup language for digital document annotation Katrin Franke 2 , Isabelle Guyon 1 , Lambert Schomaker 3 , and Louis Vuurpijl 4 1. ClopiNet, 955 Creston Rd, Berkeley, USA, isabelle@clopinet.com (corresponding author.) 2. Fraunhofer Institute, Berlin, Germany. 3. Rijksuniversiteit Groningen, The Netherlands. 4. University of Nijmegen, The Netherlands. Abstract WandaML is an XML-based markup language for the annotation and filter jour- naling of digital documents. It addresses in particular the needs of forensic handwriting data examination, by allowing experts to enter information about writer, material (pen, paper), script and content, and to record chains of image filtering and feature extrac- tion operations applied to the data. We present the design of this format and some annotation examples, in the more general perspective of digital document annotation. Annotations may be organized in a structure that reflects the document layout via a hierarchy of document regions. WandaML can lend itself to a variety of applications, including the annotation all kinds of handwriting documents (on-line or off-line), im- ages of printed text, medical images, and satellite images. Keywords: Handwriting, forensic data, XML, annotations, data format, document analysis. 1 Introduction We present the design of an XML-based markup language to annotate digital documents, called WandaML . This markup language is designed for processing, analyzing and storing handwriting samples in application to forensic handwriting examination and writer iden- tification. In the context of this application, particular specifications were met to ensure objectivity and reproducibility of the processing steps and examination results [6, 5]. Writer identification can never be as accurate as iris or DNA-based identification. However, usually a lot of constraining pieces of information are known (age category, handedness, major script style), which may reduce the size of a reference set to such an extent that automatic writer identification on the basis of script shape within that reduced set becomes viable. To that end, a portable and extensible data format is needed for modelling the knowledge from the forensic application domain. A standard database technology can then be used to apply logical constraints to the search process. 1