Applied Text Analytics for Comments on News-Articles A Bachelor Thesis Anne Schuth aschuth@science.uva.nl ABSTRACT Several on-line daily newspapers offer readers the opportu- nity to directly comment on articles. In the Netherlands this feature is used quite often and the quality (grammati- cally and content-wise) is surprisingly high. The paper de- velops techniques to collect, store, enrich and analyze these comments. After giving a high-level overview of the Dutch ‘commentosphere’ we zoom in on extracting the discussion structure found in flat comment threads; people not only comment on the news article, they also heavily comment on other comments, resembling discussion fora. We show how techniques from information retrieval, natural language processing and machine learning can be used to extract the ‘reacts-on’ relation between comments with remarkably high precision and recall. Categories and Subject Descriptors H.3.1 [Information Systems]: Information Storage and Retrieval—Content Analysis and Indexing ; H.4.3 [Information Systems]: Information Systems Applications—Communi- cations Applications General Terms Algorithms, Experimentation Keywords web mining, web data extraction 1. INTRODUCTION In the past few years, the World Wide Web saw some fun- damental changes of which the phenomenon known as user generated content is the most striking; users, as opposed to owners, of websites are now able add their content. Probably the most extensively used form of user generated content is the blog, but recently also several daily newspapers offered their readers the possibility to comment on articles. These comments include much, high quality, discussion and de- bate, and could therefore potentially add substantial to the value and appeal of the on-line version of news articles. The liveliness, for example, in a discussion following an article might indicate how hot a subject is. Also, the way people Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. react on each other might reflect the social debate that is going on about subjects similar to that of an article. For these reasons, and probably more, it is interesting to gain insight in the ‘commentosphere’. This paper aims at bringing this insight in a way similar to what was done by G. Mishne [1], albeit on a much smaller scale. Furthermore, on all of the news-sites it is presently not possible, for someone posting a comment, to specify whether he or she is commenting on the article or on some previous comment by another author. The effect is that discussions are often quite confusing and hard to read. It would be very helpful to bring forth the underlying structure, and to present this structure within the comment thread by, for example, links to posts the comment reacts on. So, besides being of interest to social scientists, investigat- ing the ‘commentosphere’ could also lead to enrichment of data already present, and new applications could be based on this additional data. 1.1 Research Questions As mentioned above, the result of analytics used, and methods developed in this paper, could be of much interest to social scientist. Therefore the research question guiding this paper falls apart in two questions. One aims to provide insight in characteristics of the ‘commentosphere’, the other focuses on developing methods to come to those insights. 1. The first question, of interest to social scientists is: What does the ‘commentosphere’ consist of and how does it look? This is an umbrella for questions like: Who is posting comments? Where do commenters come from? Do they react on each other? In what way do they react? How does a comment look? What kind of language is used? 2. More of interest to information science, is the second research question: How can properties of the ‘commen- tosphere’ be brought to light? Again, this question is broad, and answering it would involve at least answer- ing the following questions: How can text analytics methods be applied to comments? How can a person behind a comment be identified? How can the struc- ture of comment threads be made explicit? 1.2 Organization of the paper We first describe the data ‘as-it-is’ and how it is acquired, in Section 2. We then follow with applying some text an- alytics to the data, in Section 3, giving us partial answers to both research questions. Then Section 4 goes on with