Towards a taxonomy of suspected forgery in authorship attribution field. A case: Montale’s Diario Postumo Francesca Tomasi Dept. of Classical Philology and Italian Studies University of Bologna via Zamboni 32 40126 Bologna (Italy) +390512098539 francesca.tomasi@unibo.it Mirko Degli Esposti Dept. of Mathematics University of Bologna Piazza di Porta S. Donato 5 40126 Bologna (Italy) +390512094409 mirko.degliesposti@unibo.it Ilaria Bartolini Dept. of Computer Science and Engineering University of Bologna Viale Risorgimento, 2 40126 Bologna (Italy) +390512093550 i.bartolini@unibo.it Valentina Garulli Dept. of Classical Philology and Italian Studies University of Bologna via Zamboni 32 40126 Bologna (Italy) +390512098529 valentina.garulli@unibo.it Federico Condello Dept. of Classical Philology and Italian Studies University of Bologna via Zamboni 32 40126 Bologna (Italy) +390512098539 federico.condello@unibo.it Matteo Viale Dept. of Classical Philology and Italian Studies University of Bologna via Zamboni 32 40126 Bologna (Italy) +390512098585 matteo.viale@unibo.it ABSTRACT This paper wants to explore quantitative and qualitative practices generally exploited in different scientific fields (philology, mathematics, quantitative linguistics, computer science) in order to reveal forgery. Our study will be conducted on Montale‟s Diario postumo that shows all the typical features of a suspected forgery. The final aim is to merge all these methods in order to define a taxonomy of annotation elements useful, in this particular context of authorship attribution, for developing a data model to be potentially used in all forgery situations. Categories and Subject Descriptors H.3 [Information Storage And Retrieval]: H.3.1 Content Analysis and Indexing; I.4 [Image Processing And Computer Vision]: I.4.7 Feature Measurement; I.5 [Pattern Recognition]: I.5.m Miscellaneous; I.7 [Document And Text Processing]: I.7.2 Document Preparation; J.5 [Arts And Humanities]: Linguistics, Literature. Keywords forgery, quantitative linguistics, image analysis, mathematics, philology, annotation, data model, TEI. 1. INTRODUCTION The question “what a text is” is not a new topic. The variance of this concept implies different methods that could be exploited for managing an informational resource. A charming value of the text, in the domain of authorship attribution (A.A.), concerns how to reveal forgery. Mathematicians, computer scientists, philologists, quantitative linguists and digital humanists have different points of view on what a text is; this entails different strategies in order to reveal forgery. We argue that only constructive interactions between different approaches might help with complex problems such as forgery. Philologists usually adopt qualitative and comparative methods on the basis of phenomena like anachronisms concerning events and language, inconsistencies in style, patchwork effect, anomalies in the material medium or in handwriting style. Computational methods are instead essentially statistical. Authenticity, dubious attribution, plagiarism, interpolation are typical subjects of stylometry and quantitative linguistics. Quantitative linguists, but also mathematicians, use two different approaches: 1) texts as character strings, regardless of their meaning (algorithmic approach); 2) texts as word sequences that have to be studied statistically (“bag of words” approach); From the point of view of computer scientists, text could be represented also, for example, by means of the image of the text itself (e.g. a manuscript page). In this context, the application of pattern analysis and (dis)similarity search techniques (characterizing the handwriting of the page in term of “low-level features”) could help in solving the problem of authorship attribution. However, similarity is not a satisfactory criterion in order to attribute authorship in the case of suspected forgery: it is not surprising that a forgery is similar to the author‟s work: the problem is how to verify if a text is too similar to the author‟s work, and if such types of similarity cannot be found elsewhere in the extant corpus. Our first case study will be Montale‟s Diario postumo, that shows all the typical features of a suspected forgery: first of all, an excess – rather than a lack – of textual similarities (single words, word groups, sentence patterns, etc.) with Montale‟s authentic works (these similarities are mixed, of course, to many inconsistencies at the level of both style and meaning). This is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. DH-case '13, September 10 2013, Florence, Italy Copyright 2013 ACM 978-1-4503-2199-0/13/09…$15.00. http://dx.doi.org/10.1145/2517978.2517989