Automated opinion detection: Implications of the level of agreement between human raters Deanna Osman * , John Yearwood, Peter Vamplew Data Mining and Informatics Research Group (DMIRG), Centre for Informatics and Applied Optimization (CIAO), Graduate School of Information Technology and Mathematical Sciences, University of Ballarat, P.O. Box 663, Ballarat, Victoria 3353, Australia article info Article history: Received 24 June 2008 Received in revised form 18 August 2009 Accepted 20 August 2009 Available online 2 October 2009 Keywords: Opinion detection Blog06 Inter-rater agreement abstract The ability to agree with the TREC Blog06 opinion assessments was measured for seven human assessors and compared with the submitted results of the Blog06 participants. The assessors achieved a fair level of agreement between their assessments, although the range between the assessors was large. It is recommended that multiple assessors are used to assess opinion data, or a pre-test of assessors is completed to remove the most dissenting assessors from a pool of assessors prior to the assessment process. The possibil- ity of inconsistent assessments in a corpus also raises concerns about training data for an automated opinion detection system (AODS), so a further recommendation is that AODS training data be assembled from a variety of sources. This paper establishes an aspirational value for an AODS by determining the level of agreement achievable by human assessors when assessing the existence of an opinion on a given topic. Knowing the level of agree- ment amongst humans is important because it sets an upper bound on the expected per- formance of AODS. While the AODSs surveyed achieved satisfactory results, none achieved a result close to the upper bound. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction Over recent years there has been growing interest in detection of opinions in online documents. One source of online doc- uments is web logs (blogs). A blog tracking company, Technorati, Inc., reported a dramatic rise in the number of blogs, track- ing 112.8 million blogs worldwide (About technorati, 2007), up from 4.2 million in October 2004 (Rosenbloom, 2004), which equates to more than 2500% growth. In 2006, the Text Retrieval Conference (TREC) released the Blog06 document collection which included an opinion detec- tion task. Past opinion detection research has reported the detection of an opinion within a document without specifying the topic of the opinion (Kim & Hovy, 2005, 2006; Yu & Hatzivassiloglou, 2003), while the Blog06 opinion detection task differs by asking participants to detect an opinion about a given topic. The detection of an opinion about a given topic complicates the opinion detection task because blogs usually have text covering multiple topics within the same blog document, some of which can be expressing an opinion about another topic (Oard et al., 2006; Yang, Yu, Valerio, & Zhang, 2006). The lists of documents identified by the Blog06 participants as expressing an opinion on a given topic were assessed by TREC assessors. A random selection of 100 blog documents from the documents assessed as being relevant to topics within the Blog06 corpus are used as a standard assessment in this study. These assessments are compared to the assessments of seven independent human assessors to determine the level of agreement between multiple assessors. 0306-4573/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2009.08.005 * Corresponding author. Tel.: +61 3 5327 9184; fax: +61 3 5327 9704. E-mail addresses: d.osman@ballarat.edu.au (D. Osman), j.yearwood@ballarat.edu.au (J. Yearwood), p.vamplew@ballarat.edu.au (P. Vamplew). Information Processing and Management 46 (2010) 331–342 Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman