Author Identification using Writer-Dependent and Writer-Independent Strategies Daniel Pavelec , Edson Justino , Leonardo V. Batista , and Luiz S. Oliveira Pontif´ ıcia Universidade Cat ´ olica do Paran´ a (PUCPR) Programa de P´ os-Graduac ˜ ao em Inform´ atica {pavelec,justino,soares}@ppgia.pucpr.br Federal University of Para´ ıba (UFPB) Programa de P´ os-Graduac ˜ ao em Inform´ atica leonardo@di.ufpb.br ABSTRACT In this work we discuss author identification for documents written in Portuguese. Two different approaches were com- pared. The first is the writer-independent model which re- duces the pattern recognition problem to a single model and two classes, hence, makes it possible to build robust system even when few genuine samples per writer are avail- able. The second is the personal model, which very often performs better but needs a bigger number of samples per writer. We also introduce a stylometric feature set based on the conjunctions and adverbs of the Portuguese language. Experiments on a database composed of short articles from 30 different authors and Support Vector Machine (SVM) as classifier demonstrate that the proposed strategy can pro- duced results comparable to the literature. Categories and Subject Descriptors H.4 [Pattern Recognition]: Miscellaneous; D.2.8 [Doc. Engineering]: Stylometry—document analysis Keywords Author Identification, Stylometry 1. INTRODUCTION There exists a long history of linguistic and stylistic in- vestigation into author identification which goes back to the late nineteenth century, with the pioneering studies of Mendenhall [11] and Mascol [10] on distributions of sentence and word lengths in works of literature and the gospels of the New Testament. Modern work in author identification was preceded by Mosteller and Wallace in the 1960s, in their seminal study The Federalist Papers [13]. All these have Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08 March 16-20, 2008, Fortaleza, Cear´ a, Brazil Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00. been motivated by the fact that we usually leave indicative of authorship in our writings due to the fact that we have distinctive ways of writing [12]. In recent years, practical applications for author identifi- cation have grown in several different areas such as, crimi- nal law (identifying writers of ransom notes and harassing letters), civil law (copyright and estate disputes), and com- puter security (mining email content). Chaski [5] points out that in the investigation of certain crimes involving digital evidence, when a specific machine is identified as the source of documents, a legitimate issue is to identify the author that produced the documents, in other words, “Who was at the keyboard when the relevant documents were produced?”. In order to identify the author, one must extract the most appropriate features to represent the style of an author. In this context, the stylometry (application of the study of lin- guistic style) offers a strong support to define a discrimina- tive feature set. The literature shows that several stylomet- ric features that have been applied include various measures of vocabulary richness and lexical repetition based on word frequency distributions. As observed by Madigan et al [9], most of these measures, however, are strongly dependent on the length of the text being studied, hence, are difficult to apply reliably. Many other types of features have been tried out, including word class frequencies [7, 1], syntactic analy- sis [3], word collocations [16], grammatical errors [8], word, sentence, clause, and paragraph lengths [2]. To deal with the problem of author identification usually a writer-specific model (also known as personal model) is considered. It is based on two different classes, ω1 and ω2, where ω1 represents authorship while ω2 represents forgery. The main drawbacks of the writer-specific approach are the need of learning the model each time a new author should be included in the system and the great number of gen- uine samples of texts necessary to build a reliable model. An alternative to this strategy is the writer-independent ap- proach. It uses the dissimilarity representation [14] and can be defined as writer-independent approach as the number of models does not depend on the number of writers. In this context, it is a global model by nature, which reduces the pattern recognition problem to a global model with two classes, consequently, makes it possible to build robust au- thor identification systems even when few genuine samples per author are available. 414