Author Identification on the Large Scale David Madigan 1,3 Alexander Genkin 1 David D. Lewis 2 Shlomo Argamon 4 Dmitriy Fradkin 1,5 and Li Ye 3 (1) DIMACS, Rutgers University (2) David D. Lewis Consulting (3) Department of Statistics, Rutgers University (4) Department of Computer Science, Illinois Institute of Technology (5) Department of Computer Science, Rutgers University 1 Introduction Individuals have distinctive ways of speaking and writing, and there exists a long history of linguistic and stylistic investigation into authorship attribution. In recent years, practical applications for authorship attribution have grown in areas such as intelligence (linking intercepted messages to each other and to known terrorists), criminal law (identifying writers of ransom notes and harassing letters), civil law (copyright and estate disputes), and computer security (tracking authors of com- puter virus source code). This activity is part of a broader growth within computer science of identification technologies, including biometrics (retinal scanning, speaker recognition, etc.), cryptographic signatures, intrusion detection systems, and others. Automating authorship attribution promises more accurate results and objective measures of reliability, both of which are critical for legal and security applications. Recent research has used techniques from machine learning [3, 10, 13, 31, 50], mul- tivariate and cluster analysis [24, 25, 8], and natural language processing [5, 46] in authorship attribution. These techniques have also been applied to related prob- lems such as genre analysis [4, 1, 6, 17, 23, 46] and author profiling (such as by gender [2, 12] or personality [38]). Our focus in this paper is on techniques for identifying authors in large col- lections of textual artifacts (e-mails, communiques, transcribed speech, etc.). Our approach focuses on very high-dimensional, topic-free document representations and particular attribution problems, such as: (1) Which one of these K authors wrote this particular document? (2) Did any of these K authors write this particular document? Scientific investigation into measuring style and authorship of texts goes back to the late nineteenth century, with the pioneering studies of Mendenhall [36] and Mascol [34, 35] on distributions of sentence and word lengths in works of literature and the gospels of the New Testament. The underlying notion was that works by different authors are strongly distinguished by quantifiable features of the text. By the mid-twentieth century, this line of research had grown into what became known as “stylometrics”, and a variety of textual statistics had been proposed to quantify textual style. The style of early work was characterized by a search for invariant properties of textual statistics, such as Zipf’s distribution and Yule’s K statistic. 1