A Comparative Study of Human Coding and Context Analysis against Support Vector Machines (SVM) to Differentiate Campaign Emails by Party and Issues Examining Narrowcasting in the 2004 Presidential Election By Stephen Purpura (Harvard University), Dustin Hillard (University of Washington), and Dr. Philip Howard (University of Washington) December 6, 2004 -- WORKING DRAFT Last Updated: January 7, 2006 1 Introduction This work investigates classification of emails sent by the political parties during the 2004 presidential election. Given an email without contextual information, we classify it as originating from either the Republican or Democratic Party. While this type of task is well practiced in the political communication literature, we use this exercise to demonstrate a new methodological technique to bridge the qualitative world of coding & context analysis with empirical analysis and methods. 1 Our experiment involves two parallel studies using the same data set and coding rules. The first study is a traditional context analysis experiment conducted by Dr. Philip Howard at the University of Washington. The second is a computer-assisted context analysis conducted by Dustin Hillard and Stephen Purpura. The focus of this paper is to describe how a skilled computer scientist would approach the problem of categorizing thousands of email messages. Text categorization problems are frequently encountered by political communication analysts, and current methods employ manual techniques or computer software which searches for keywords in the text. While our proposed methods do not replace these techniques, we identify a new tool to arm the social science researcher with the fruits of advancing research in computational linguistic algorithms. To bridge the gap between qualitative political communication analysis and research from computational linguistics, our work relies heavily on the computational linguistics literature. Context analysis in political communication has similar goals to topic spotting in newswire data. Numerous techniques for topic classification have been well documented. In this work, support vector machines (SVMs) are chosen due to their relatively strong performance on a wide variety of tasks. Support vector machines are a natural fit for topic classification because they deal well with sparse data and large dimensionality. Our only departure from most standard topic classification research is that campaign emails have different language patterns and characteristics from the typical news stories or broadcasts usually used in topic classification. Unlike news stories or broadcasts, campaign emails are repetitive and very similar. This can make the task of 1 The authors wish to thank our reviewers: Dr. James Purpura (Columbia University), Dan Hopkins (Harvard University), and Dr. John Wilkerson (University of Washington). 1 of 11