Crawling JavaScript websites using WebKit – with application to analysis of hate speech in online discussions Hugo Hammer 1 , Alfred Bratterud and Siri Fagernes Oslo and Akershus University College of Applied Sciences Institute of Information Technology Abstract JavaScript Client-side hidden web pages (CSHW) contain dynamic material created as a result of specific user activities. The number of CSHW websites is increasing. Crawling the so-called Hidden Web is challenging, particularly when JavaScript CSHW from an external website is seamlessly included as part of the web pages. We have developed a prototype web crawler that efficiently extracts content from CSHW. The crawler uses WebKit to render web pages and to emulate human web page activities to reveal dynamic content. The WebKit crawler was used to collect text from 39 Norwegian online newspaper debate articles, where the online user discussions were included as JavaScript CSHW from other websites. The average speed to extract the main content and the JavaScript-generated discussions were 36.3 kB/sec and 8.8 kB/sec, respectively. Analyzing the collected text from the news paper debate articles using opinion mining, documents that the debate articles are more positive to Islam and Muslims than the following discussions. The results demonstrate the importance of being able to collect such JavaScript CSHW discussion content to get an overview of existing hate speech on the Internet. 1 Introduction Over the past years there has been an alarming growth in hate against minorities like Muslims, Jews, gypsies and gays in Europe, driven by right wing populism parties and extremist organizations [12] [28]. A similar increase in hate speech has been observed on the Internet [14] [4], and experts are concerned that people influenced by this web content may resort to violence as a result [24] [25]. Social media and online discussions contain a wealth of information which can make us able to understand the extent of hate speech on the Internet and the risks of violence it may cause. However, it turns out that academia is lacking research on social media and online radicalization [26]. Automatic analysis of social media has recently received much attention in other areas as well. Companies can use information extracted from social media to better understand consumers’ attitudes to their products or make online marketing targeted toward each costumer. Over the recent years such analysis has become big industry with large international actors like IBM and SAS Institute [8] [15]. Journalists can use social media analysis to quickly identify trends or rumors, or see the public response to world events. Social media analysis relies on collecting and analyzing text from online discussions. Software programs that traverse the Internet following hypertext links to collect web 1 hugo.hammer@hioa.no