Constructing a Scientific Blog Corpus for Information Credibility Analysis Eric Nichols, Koji Murakami, Kentaro Inui, and Yuji Matsumoto Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara, 630-0192 JAPAN {eric-n,kmurakami,inui,matsu}@is.naist.jp Abstract In this paper we discuss the construction of an English corpus for use in evaluating the credibility of information on the Web. Be- cause the identification of conflicting opin- ions and their logical justifications is of great importance, we turn to scientific blogs as our primary source of data. By exploiting blog post metadata and the network structure of the scientific blogging community, we mini- mize construction costs by automating many of the tasks necessary for data collection and annotation. We propose a technique for gath- ering blog posts into multi-document discus- sions that share a common source of interest and evaluate a filtering method for reducing noise in the corpus data. 1 Motivation The importance of the internet as a source of infor- mation cannot be disputed. A recent poll (Pew Re- search, 2008) found that among Americans the inter- net has overtaken newspapers as a news outlet and rivaled television for those surveyed under the age of thirty. In this age of widespread internet access, any- one with a computer can put their ideas on the Web where they can be viewed by a large audience. How- ever, this removal of the barriers to publication has also made it easier to spread false information. 1.1 The Anti-vax Movement: A Cautionary Tale The anti-vaccination movement (hereafter ”the anti- vax movement”) is a good example of the danger of misinformation. In 1998, a group of researchers in the UK published a study implying a causal connection between Measles, Mumps, and Rubella (MMR) vac- cinations and the development of autism in children (Wakefield et al., 1998). Though further scrutiny of these initial results disproved the autism-vaccination link, culminating in the withdrawal of endorsements by 10 of the study’s 12 authors, the damage had al- ready been done. The mainstream media picked up on the study, am- plifying fears about the safety of vaccinations in an al- ready nervous public. An anti-vaccination movement soon formed, fueled by celebrity activists. Online communities 1 developed, insulating their members against the medical evidence to the contrary. Vaccina- tion rates plummeted despite the best efforts of public health organizations (Finding Dulcenia, 2009). The result of the spread of the anti-vax movements was that in 2008, for the first time in over a decade, there was a resurgence in the number of reported cases of measles in both the United States (CDC, 2008) and Europe. The situation in the UK was seri- ous enough to be elevated to an endemic (Eurosurvel- liance, 2008). Measles, which in the 1990s was con- sidered a cured disease, was making a comeback. 1.2 The Importance of Evaluating Credibility The case of the anti-vax movement causing a resur- gence in measles is tragic, but it could have been pre- vented. After all, the study of Wakefield et al. (1998) was repeated numerous times in an attempt to ver- ify the connection between MMR vaccinations and autism, and the results were overwhelmingly against such a causative connection 2 . But this information did not get to the very people concerned about the safety of vaccinations. Part of the blame belongs with the mainstream media which both online and offline, was more interested in entertaining conspiracy theo- ries than presenting the wealth of evidence disproving a vaccination-autism link, but the underlying problem that people did not know how to find trustworthy evi- dence to the contrary is illustrative of the need to as- sist the evaluation of information credibility. 1 http://www.ageofautism.com 2 An updating list of studies can be found at http://en. wikipedia.org/wiki/MMR_vaccine_controversy