Offline handwriting segmentation for writer identification Eduardo Cermeño, Silvana Mallor Research Department Vaelsys Madrid, Spain eduardo.cm@vaelsys.com, silvana.mh@vaelsys.com Juan Alberto Sigüenza Departamento de Informática Universidad Autónoma de Madrid Madrid, Spain j.alberto.siguenza@uam.es Abstract—In this paper we present a new technique for off- line text-independent handwriting analysis based on segmentation. Segmentation is a common step used in different research works in order to generate connected components that will be processed to extract features (geometry, concavity etc.). Our work focuses in the segmentation process and the information that can be directly extracted from the way a writer joins or separates ink connected components without need of analyzing the components themselves. The proposed multi-segmentation method shows good results tested on its own with real documents from police corps database and suggest an improved way to apply segmentation to other connected component based systems. Keywords-handrwritting, biometric, segmentation, police, security connected components, image, machine learning; I. INTRODUCTION Handwriting analysis is a common task in forensic prac- tice. Finding out the authorship of a questioned document could provide valuable information for an investigation, this process is called writer identification. A second application called writer verification is used to provide testimony in court about the authorship of a document. In 2002 Srihari et al. [11] presented a study to validate the hypothesis that handwriting is individualistic using computers. Human expertise is getting more and more support from computer based tools in order to make this practice faster, more accurate and easier to scale. Writer verification could be easily addressed by human means since handwriting comparison is done with documents from one unknown person and documents with one known author. On the other side, writer identification may not be possible without computer support if the number of known authors becomes too big. Tapiador and Sigüenza [13] describe a method used by police experts to compare different questioned documents that consist on the classification of relevant characters based on their shapes. Each document is then formulated using the results of the classification of its characters. Comparing the formulations make document examination faster. Schomaker and Bucalu [8] present a method based on a similar concept, generating a codebook with connected components. The use of connected components instead of characters is important because of the difficulty to segment characters properly. In handwriting analysis a connected component means connected ink component. A writer could be considered as a stochastic generator of connected components [8]. Depending on the writer, these components may codify a character fragment, a character, a word fragment or even a complete word. The segmentation procedure extracts connected components from a scanned document. Schomaker [7] provides a classification of different methods for automatic writer identification depending on the extracted features. The two approaches with better performance deal with (i) textural aspects capturing pen- grip and pen-attitude preferences and (ii) allographic elements analyzing character or character fragment shapes. The first approach, usually called global is used by Said et al. [6] reaching up to 96% of accuracy in a database with 10 writers, highlighting text independent recognition. It’s based on the observation that handwriting is visually distinctive and requires several image corrections. More recently Bertolini et al [1] reached even better results 96,7% with a database of 650 writers [5]. The second approach, usually called local is implemented in [9] with results up to 95% accuracy for a database with 10 writers. This approach is the one chosen by most experts for manual document examination. According to the comparison of results presented in [1] it is hard to tell which approach is better for automatic writer identification using computers since both perform very well. In this work we will focus on the segmentation phase. Bertolini et al [1] state that the main bottlenecks