Author Identification using Writer-Dependent and Writer-Independent Strategies Daniel Pavelec † , Edson Justino † , Leonardo V. Batista ‡ , and Luiz S. Oliveira † † Pontif´ ıcia Universidade Cat ´ olica do Paran´ a (PUCPR) Programa de P´ os-Graduac ˜ ao em Inform´ atica {pavelec,justino,soares}@ppgia.pucpr.br ‡ Federal University of Para´ ıba (UFPB) Programa de P´ os-Graduac ˜ ao em Inform´ atica leonardo@di.ufpb.br ABSTRACT In this work we discuss author identiﬁcation for documents written in Portuguese. Two diﬀerent approaches were com- pared. The ﬁrst is the writer-independent model which re- duces the pattern recognition problem to a single model and two classes, hence, makes it possible to build robust system even when few genuine samples per writer are avail- able. The second is the personal model, which very often performs better but needs a bigger number of samples per writer. We also introduce a stylometric feature set based on the conjunctions and adverbs of the Portuguese language. Experiments on a database composed of short articles from 30 diﬀerent authors and Support Vector Machine (SVM) as classiﬁer demonstrate that the proposed strategy can pro- duced results comparable to the literature. Categories and Subject Descriptors H.4 [Pattern Recognition]: Miscellaneous; D.2.8 [Doc. Engineering]: Stylometry—document analysis Keywords Author Identiﬁcation, Stylometry 1. INTRODUCTION There exists a long history of linguistic and stylistic in- vestigation into author identiﬁcation which goes back to the late nineteenth century, with the pioneering studies of Mendenhall [11] and Mascol [10] on distributions of sentence and word lengths in works of literature and the gospels of the New Testament. Modern work in author identiﬁcation was preceded by Mosteller and Wallace in the 1960s, in their seminal study The Federalist Papers [13]. All these have Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08 March 16-20, 2008, Fortaleza, Cear´ a, Brazil Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00. been motivated by the fact that we usually leave indicative of authorship in our writings due to the fact that we have distinctive ways of writing [12]. In recent years, practical applications for author identiﬁ- cation have grown in several diﬀerent areas such as, crimi- nal law (identifying writers of ransom notes and harassing letters), civil law (copyright and estate disputes), and com- puter security (mining email content). Chaski [5] points out that in the investigation of certain crimes involving digital evidence, when a speciﬁc machine is identiﬁed as the source of documents, a legitimate issue is to identify the author that produced the documents, in other words, “Who was at the keyboard when the relevant documents were produced?”. In order to identify the author, one must extract the most appropriate features to represent the style of an author. In this context, the stylometry (application of the study of lin- guistic style) oﬀers a strong support to deﬁne a discrimina- tive feature set. The literature shows that several stylomet- ric features that have been applied include various measures of vocabulary richness and lexical repetition based on word frequency distributions. As observed by Madigan et al [9], most of these measures, however, are strongly dependent on the length of the text being studied, hence, are diﬃcult to apply reliably. Many other types of features have been tried out, including word class frequencies [7, 1], syntactic analy- sis [3], word collocations [16], grammatical errors [8], word, sentence, clause, and paragraph lengths [2]. To deal with the problem of author identiﬁcation usually a writer-speciﬁc model (also known as personal model) is considered. It is based on two diﬀerent classes, ω1 and ω2, where ω1 represents authorship while ω2 represents forgery. The main drawbacks of the writer-speciﬁc approach are the need of learning the model each time a new author should be included in the system and the great number of gen- uine samples of texts necessary to build a reliable model. An alternative to this strategy is the writer-independent ap- proach. It uses the dissimilarity representation [14] and can be deﬁned as writer-independent approach as the number of models does not depend on the number of writers. In this context, it is a global model by nature, which reduces the pattern recognition problem to a global model with two classes, consequently, makes it possible to build robust au- thor identiﬁcation systems even when few genuine samples per author are available. 414