Parsing with Subdomain Instance Weighting from Raw Corpora Barbara Plank 1 , Khalil Sima’an 2 1 Alfa informatica, Faculty of Arts, University of Groningen, The Netherlands 2 Language and Computation, Faculty of Science, University of Amsterdam, The Netherlands b.plank@rug.nl, k.simaan@uva.nl Abstract The treebanks that are used for training statistical parsers con- sist of hand-parsed sentences from a single source/domain like newspaper text. However, newspaper text concerns different subdomains of language use (e.g. finance, sports, politics, mu- sic), which implies that the statistics gathered by generative statistical parsers are averages over subdomain statistics. In this paper we explore a method, subdomain instance-weighting, that exploits raw subdomain corpora for introducing subdomain statistics into a state-of-the-art generative parser. We employ instance-weighting for creating an ensemble of subdomain spe- cific versions of the parser, and explore methods for amalga- mating their predictions. Our experiments show that subdomain statistics extracted from raw corpora can even improve the qual- ity of the n-best lists of a formidable, state-of-the-art parser. Index Terms: statistical parsing, adaptation, subdomains, in- stance weighting 1. Motivation Generative models for statistical parsing are currently comple- mented with discriminative rerankers e.g., [1, 2]. The n-best parses generated by the parser for any input sentence (together with their probabilities) are reranked on the basis of rich fea- ture sets and conditional probability estimates. The generative parser, essentialy a joint probability over sentence-parse pairs defined by a generative grammar, are trained on treebanks like the Penn Wall Street Journal (WSJ) [3]. Usually, a corpus con- sists of language use concerning a range of topics. As observed by [4], subdomains like ”politics, stock market, financial news etc. can be found“ in the WSJ. When a joint probability (over sentence-parse pairs) is trained on this treebank, the statistics gathered are averages over the different subdomains. By defini- tion, averages smooth-out the statistical differences between the individual subdomains and could possibly make the generative model’s task of initial ranking harder. The present paper explores the question whether there is any gain to be had from incorporating subdomain statistics in parse-reranking. It describes a new method for incorporating subdomain statistics into an existing state-of-the-art parser [1] in order to improve its n-best lists. The main idea is to ex- ploit unannotated, subdomain specific corpora gathered from the web, for weighting the original treebank trees so they re- flect subdomain statistics, and employ the resulting weighted treebanks for training individual subdomain sensitive parsers. Our weighting method can be seen as an instance of “Instance Weighting”, an idea that surfaced in the context of adaptation [5], but has not been instantiated or tested before (within statis- tical parsing). In this paper we depart from a formidable parser (Charniak’s) and exhibit how it may benefit from “subdomain instance-weighting” for composing its n-best lists. In what follows we first disucss related work, then we de- scribe the rationale behind instance weighting and define our weighting approach. Subsequently we outline our experimental setting and exihibit results with parsing-reranking the WSJ us- ing the instance weighting technique. Finally we discuss some conclusion from this work. 2. Related Work An early related study is [10]. Sekine analyzes the “domain dependence of parsing”. In his experiments, a domain is char- acterized by the natural domains defined in the Brown corpus, for example ’Press Reportage’, ’General Fiction’ or ’Romance and Love Story’. Sekine observes that in parsing, the data from the same domain is the most advantageous, followed by data from the same class, while training on data from another do- main generally performs worst. Sekine concludes that when trying “to parse a text in a particular domain, we should pre- pare a grammar which suits this domain” [10], thus suggesting a “domain-dependent parser”. Although different in flavour, work on domain adaptation is rather related to our work. While domain adaptation aims at adapting a parser from one domain to another, we aim here at finding the influence of specific unlabeled subdomain data on the performance of a “broad-coverage” parser. The role of subdomains and domains in a statistical classifier/model is not exactly the same: in the present work we try to produce special- ized subdomain parsers in order to improve parser quality, as opposed to using raw data to migrate the statistics from domain to another. Recent research on adaptation is too numerous to discuss in detail in this paper. In particular, [5] suggest “instance weight- ing” as a method for adaptation. They examine their approach on three Natural Language Processing tasks: POS tagging, en- tity type classification and spam filtering. Our approach, subdo- main instance weighting using raw data, can be seen as a novel version thereof for statistical parsing. Theoretically speaking, successfull domain adaptation hinges on some sense of “overlap” between the source and tar- get domains, e.g., [6]. The overlap between source and target domains can be seen as a (mix of) subdomain(s) of both. Nat- urally, instance weighting, and its subdomain instantiation, can be seen as a weighted versions of limited self-training, e.g., [2], which is again related to co-training [7, 8]. 3. Data and Tools All experiments were performed using the first-stage genera- tive parser of Charniak [1]. We use the Penn Treebank (PT) Wall Street Journal (WSJ) [3], with the by now ’standard divi- sion’ into training (sections 02-21) and development/dev (sec-