hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene Nikola Ljubeˇ si´ c 1 and Tomaˇ z Erjavec 2 1 Faculty of Humanities and Social Sciences, University of Zagreb, Croatia nikola.ljubesic@ffzg.hr 2 Dept. of Knowledge Technologies, Joˇ zef Stefan Institute, Ljubljana, Slovenia tomaz.erjavec@ijs.si Abstract. Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper in- troduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Cor- pus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content ex- traction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text- types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC. Keywords: web corpus, Croatian, Slovene, topic modeling 1 Introduction With the advent of the web, a vast new source of linguistic information has emerged. The exploitation of this resource has especially gained momentum with the WaCky initiative [1], which has popularised the concept of ”Web as Corpus”. It has also made available tools for compiling such corpora and produced large WaC corpora for a number of major European languages. Now such corpora are also being built for the so called smaller languages, such as Norwegian [2] and Czech [3], moving the concept of a ”large corpus” for smaller languages up to the 1 billion token frontier. As Web corpus acquisition is much less controlled than that for traditional corpora, the necessity of analyzing their content gains in significance. The linguistic quality of the content is mostly explored through word lists and collocates [1] while the content itself is explored using unsupervised methods, such as clustering and topic modeling [4]. 2 Building the hrWaC and slWaC The standard pipeline for building web corpora was developed primarily for languages where the amount of web data is orders of magnitude larger than the corpus being built. On the other hand, smaller languages cannot afford the