Automatic Source Attribution of Text: A Neural Networks Approach Foaad Khosmood and Franz Kurfess, Ph.D. Department of Computer Science California Polytechnic State University San Luis Obispo, CA 93407 foaadk(a)dyahoo.com, fturfess(calpoly.edu ABSTRACT Recent advances in automatic authorship attribution have been promising. Relatively new techniques such as N-gram analysis have shown important improvements in accuracy 121. Much of the work in this area does remain in the realm of statistics best suited for human assistance rather than autonomous attribution 161. While there have been attempts at using neural networks in the area in the past, they have been extremely limited and problem-specific 171. This paper addresses the latter points by demonstrating a practical and truly autonomous attribution process using neural networks. Furthermore, we use a word-frequency classification technique to demonstrate the feasibility of this process in particular and the applications of neural networks to textual analysis in general. Key Words: neural networks, computational linguistics, authorship attribution, source attribution. I. INTRODUCTION We define automatic source attribution as the ability for an autonomous process to determine the source of a previously unexamined piece of text. A software system designed to follow such a process would analyze a set of input corpora, and construct a neural network to engage in attribution. It would then train the network with the corpora; apply the sample texts and determine attribution. For our source recognition problem, our system constructs a 5 layer, 420 Million-connection neural network. It is able to correctly attribute sample texts, previously unexamined by the system. Specifically, we conduct three sets of experiments to test the ability of the system: broad categorization, narrow categorization and minimal-sample categorization. An automatic source attribution system must be able to digest a set of text corpora with known sources in order to determine the source or literary originator of a new piece of writing. The word "automatic" is meant to emphasize the desired absence of human intervention in the attribution process. Most of the work in source or authorship attribution is currently done with heavy involvement of humans. Even many computerized or statistical methods serve merely as assistants to human decision makers [6] who still to some extent subjectively evaluate the writing. This is true of almost all the famous authorship attribution cases. For example some statistical methods were used in assisting human specialists in determining The Federalist Papers dispute (some claimed by both Hamilton and Madison) [7]. We build on previous experience to make ours a problem- independent autonomous system. II. SOURCE VERSUS AUTHORSHIP Many of the previous works within the field of computer science refer to this area as "authorship attribution." For a variety of reasons, we believe "source attribution" is a more accurate description of our experiments. The works of different individuals can appear together as part of the same unit with the same style and linguistic distinction. Associated Press news stories for example may be written by several different individuals but they all adhere to the same established writing style and may report about the same subject or even the same incident. The Bible and technical manuals are also examples of distinctive "sources." There are thus multiple factors that constitute "source." Two of the most important ones are originator and subject matter. It is important to note that each of these spheres of contribution have shifting scopes that depend on other sources they are being distinguished from. Originator, for example, could mean "Shakespeare," or "British author" or "English language author" depending on what else it's being compared to. Similarly subject could be relatively narrow such as "US Foreign Policy in Latin America 1999-2000" or broad like "Love" or "Life." 2718