The WeSearch Corpus, Treebank, and Treecache A Comprehensive Sample of User-Generated Content Jonathon Read , Dan Flickinger , Rebecca Dridan , Stephan Oepen , and Lilja Øvrelid University of Oslo, Department of Informatics Stanford University, Center for the Study of Language and Information { jread | rdridan | oe | liljao } @ifi.uio.no, danf@stanford.edu Abstract We present the WeSearch Data Collection (WDC)—a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with a focus on the extraction of linguistic content from different sources. We present the format of syntacto-semantic annotations found in this resource and present initial parsing results for these data, as well as some reflections following a first round of treebanking. Keywords: User-Generated Content, Open-Source Corpus, Manually Validated Treebank, Automatically Created Treecache 1 Background—Motivation An ever increasing proportion of the Internet is comprised of so-called User-Generated Content (UGC). Applications that seek to ‘tap into the wisdom of the masses’ demand various levels of natural language processing (NLP) of these kinds of text. For statistical parsing, for exam- ple, Foster et al. (2011) observe that common off-the- shelf parsers—trained on the venerable Wall Street Jour- nal data—perform between ten and twenty F 1 points worse when applied to social media data. To enable more R&D into the linguistic properties of common types of UGC as well as into the technological challenges it presents for NLP, we are making available a freely redistributable, partly annotated, comprehensive sample of UGC—the We- Search Data Collection (WDC). The term ‘domain adaptation’ has at times been used to characterise the problem of tuning NLP tools for specific types of input (Plank, 2011). In our view, however, it is desirable to reserve the term ‘domain’ for content proper- ties of natural language samples (i.e. the subject matter), and complement it with the notion of ‘genre’ to characterise formal properties of language data (i.e. the text type). On this view, parser adaptation would typically comprise both domain and genre adaptation (and possibly others). Thus, in the WDC resource discussed below, we carefully try to tease the two dimensions of variation apart—seeking to en- able systematic experimentation along either of the two di- mensions, or both. In this work, we develop a large, carefully-curated sample of UGC comprised of three components: the WDC Cor- pus, Treebank, and Treecache. Here, the corpus comprises the unannotated, but utterance-segmented text (at variable levels of ‘purification’); the treebank provides fine-grained, gold-standard syntactic and semantic analyses; and the treecache is built from automatically constructed (i.e. not manually validated) syntacto-semantic annotations in the same format. 1 1 In recent literature, the term treebank is at times used to refer to automatically parsed corpora. To maintain a clear distinction between validated, gold-standard vs. non-validated annotations, The article is structured as follows: Section 2 describes the selection of sources for the data collection and Section 3 goes on to detail the harvesting and extraction of content from these data sources. In Section 4 we describe the organ- isation of the corpus into three versions with different for- mat, as well as organisation with respect to genre/domain and standardised train-test splits. Moving on to annotation, Section 5 presents the annotation format for the data collec- tion, while Section 6 describes initial parse results for the full data collection and Section 7 provides some reflections regarding quality and domain-and genre specific properties of the data following an initial round of treebanking. Fi- nally, Section 8 details next steps in terms of corpus refine- ment and ultimate release of the resource. 2 Data Selection When selecting data for our corpus, we are firstly inter- ested in a variety of registers of user-generated content (i.e. genres, in our sense) that represent a range of linguistic formality. To date, we therefore obtained text from user forums, product review sites, blogs, and Wikipedia. Al- beit ‘user-generated’ only in a stretch, future versions of the corpus will also include open-access research literature. Secondly we acquired text from sources that discuss either the Linux operating system or natural language processing. The choice of these domains is motivated by our assump- tion that the users of the corpus will be more familiar with the language used in connection with these topics than (for example) that used in the biomedical domain. Table 1 lists the complete set of data sources for the first public release of the WDC. 2 The selection reflects linguis- tic variation (ranging from the formal, edited language of Wikipedia and blogs, to the more dynamic and informal we coin the parallel term treecache to refer to automatically cre- ated collections of valuable, if not fully gold-standard trees. Note that this notion is related to what Riezler et al. (2000), in the con- text of Lexical Functional Grammar, dub a parsebank—though not fully identical to the interpretation of Ros´ en et al. (2009) of that term (also within the LFG framework). 2 See www.delph-in.net/wesearch for technical de- tails and download instructions.