Document Author Classification using Generalized Discriminant Analysis Todd K. Moon, Peg Howland, Jacob H. Gunther Utah State University Abstract Classification by document authorship based on statisti- cal analysis — stylometry — is considered here by us- ing feature vectors obtained from counts of all words in the intersecting sets of the training data. This differs from some previous stylometry, which used only selected “non- contextual” words with the highest counts, and also from conventional text search techniques, where noncontextual words are frequently left out when the term-by-document matrices are formed. The dimensionality of the resulting vector is reduced using a generalized discriminant anal- ysis (GDA). The method is tested on three sets of docu- ments which have been previously subjected to statistical analysis. Results show that the method is successful at identifying author differences and at classifying unknown authorship, consistent with previous techniques. Keywords: author identification; LDA/GSVD; stylome- try. 1 Introduction and Background It has been suggested (see, e.g., [1, 2]) that authors leave tell-tale footprints in their writings indicative of author- ship, which can be revealed by an appropriate statistical analysis. Following [1], we refer to such methods of au- thorship study as stylometry, or stylometric analysis. Sty- lometry is based on the assumptions that authors uncon- sciously use some word patterns in a manner more or less consistent across documents and across time and that, be- cause the use of these words is unconscious, even imita- tors can be distinguished from the authors they would im- itate. Extensive testing of stylometric analysis on works by various authors has provided at least partial validation of the underlying assumptions. For example, Sir Walter Scott showed little statistical variation in his style, even over a career interrupted by five strokes [1, Chapter 10]. And in a series of statistical tests, the author Robert Hein- lein’s signature uniquely showed through even when he was writing as two different narrators in The Number of the Beast [3, p. 106]. Statistical analysis of documents goes back to Au- gustus de Morgan in 1851 [4, p. 282], [1, p. 166], who proposed that word length statistics might be used to de- termine the authorship of the Pauline epistles. Since that initial proposal (not actually carried out by de Morgan), the Bible has been subjected to extensive statistical scruti- nies, many of them reaching conflicting conclusions. Sty- lometry was also employed as early as 1901 to explore the authorship of Shakespeare [5]. Since then, it has been em- ployed in a variety of literary studies (see, e.g., [6, 7, 8]), including twelve of The Federalist papers which were of uncertain authorship [9] (which we re-examine here), and an unfinished novel by Jane Austen (which we also re- examine here). Information theoretic techniques have also recently been used [10]. Stylometry is usually based on “noncontextual words,” words which do not convey the primary mean- ing of the text, but which act in the background of the text to provide structure and flow. Noncontextual words are at least plausible, since an author may address a variety of topics, so particular distinguishing words are not neces- sarily revealing of authorship. As stated in [11]: The noncontextual words which have been most successful in discriminating among authors are the filler words of the language such as prepo- sitions and conjunctions, and sometimes adjec- tives and adverbs. Authors differ in their rates of usage of these filler words. (However, statistical analysis based on author vocabulary size vs. document length — the “vocabulary richness” — has also been explored [12].) In noncontextual word stud- ies, a restricted set of “most common” words is selected [1], and documents are represented by word counts, or ra- tios of word counts to document length. As a variation, sets of ratios of counts of noncontextual word patterns to other word patterns are also employed [3]. However, it has largely been a matter of investigator choice which words are selected as noncontextual, opening the stylo- metric analysis to criticisms of nonobjectivity. In this work, we examine all of the words in the in- tersection of the documents in question. This results in a higher dimensional space than has been conventional. The dimensionality is handled, however, using a gener-