arXiv:1611.06947v1 [cs.SI] 21 Nov 2016 Social Media as a Sensor for Censorship Detection in News Media Rongrong Tao 1 , Baojian Zhou 2 , Adil Alim 2 , Feng Chen 2 , David Mares 3 , Patrick Butler 1 , Naren Ramakrishnan 1 1 Discovery Analytics Center, Department of Computer Science, Virginia Tech, Arlington, VA, USA 2 Department of Computer Science, University at Albany, SUNY, Albany, NY, USA 3 University of California at San Diego, San Diego, CA, USA rrtao@vt.edu, {bzhou6, aalimu, fchen5}@albany.edu, dmares@ucsd.edu pabutler@vt.edu, naren@cs.vt.edu Abstract Censorship in social media has been well studied and pro- vides insight into how governments stifle freedom of expres- sion online. Comparatively less (or no) attention has been paid to censorship in traditional media (e.g., news) using social media as a bellweather. We present a novel unsuper- vised approach that views social media as a sensor to detect censorship in news media wherein statistically significant dif- ferences between information published in the news media and the correlated information published in social media are automatically identified as candidate censored events. We develop a hypothesis testing framework to identify and eval- uate censored clusters of keywords, and a new near-linear- time algorithm (called GraphDPD) to identify the highest scoring clusters as indicators of censorship. We outline ex- tensive experiments on semi-synthetic data as well as real datasets (with Twitter and local news media) from Mex- ico and Venezuela, highlighting the capability to accurately detect real-world censorship events. 1. INTRODUCTION News media censorship is generally defined as a restriction on freedom of speech to prohibit access to public informa- tion, and is taking place more than ever before. According to the Freedom of the Press Report, 40.4 percent of nations fit into the “free” category in 2003. By 2014, this global per- centage fell to 32 percent [2], as shown in Figure 1. More than 200 journalists were jailed in 2014, according to the Committee to Protect Journalists. In fact, in the past three years, more than 200 journalists have been jailed annually [1]. Although the social and political aspects of news media censorship have been deeply discussed and analyzed in the field of social sciences [13, 29, 27], there is currently no ef- ficient and effective approach to automatically detect and track such censorship events in real time. Different from the task of Internet censorship detection in which a collection of labeled data (e.g., deleted posts or blogs in social media websites) can be collected to support supervised learning [14, 31], the detection of censorship in news media often has no labeled data available for training, and must rely on unsupervised techniques instead. In this paper, we present a novel unsupervised approach that views social media as a sensor to detect censorship in news media wherein statistically significant differences be- Figure 1: Worldwide freedom of the press (2014) [2]. The higher the score, the worse the press freedom status. tween information published in the news media and the cor- related information published in social media are automati- cally identified as candidate censored events. A generalized log-likelihood ratio test (GLRT) statistic can then be formulated for hypothesis testing, and the prob- lem of censorship detection can be cast as the maximization of the GLRT statistic over all possible clusters of keywords. We propose a near-linear-time algorithm called GraphDPD to identify the highest scoring clusters as indicators of cen- sorship events in the local news media, and further ap- ply randomization testing to estimate the statistical signifi- cances of these clusters. We consider the detection of censorship in the news media of two countries, Mexico and Venezuela, and utilize Twitter as the uncensored source. Starting in January 2012, a “Country-Withheld Content” policy has been launched by Twitter, with which govern- ments are able to request withholding and deletion of user accounts and tweets [12]. At the same time, Twitter started to release a transparency report, which provided worldwide information and removal requests for user accounts and tweets [7]. The Transparency Report lists information and removal requests from Year 2012 to 2015 on a half-year basis. Ta- ble 1 summarizes the information and removal requests for Year 2014 on nine countries of interest. As shown in Table 1, we can see although all of these countries have ever issued account information requests, most of them did not intend