Robust Pre-Processing and Semi-Supervized Lexical Bridging for User-Generated Content Parsing Djamé Seddah 1,2 Benoît Sagot 2 Marie Candito 2 1. ISHA, Université Paris–Sorbonne, Paris, France 2. Alpage, INRIA & Université Paris–Diderot, Paris, France djame.seddah@paris-sorbonne.fr, benoit.sagot@inria.fr, marie.candito@linguist.jussieu.fr Abstract We describe the architecture we set up dur- ing the SANCL shared task for parsing user- generated texts, that deviate in various ways from linguistic conventions used in available training treebanks. This architecture focuses in coping with such a divergence. It relies on the PCFG-LA framework (Petrov and Klein, 2007), as implemented by Attia et al. (2010). We explore several techniques to augment ro- bustness: (i) a lexical bridge technique (Can- dito et al., 2011) that uses unsupervised word clustering (Koo et al., 2008); (ii) a special instanciation of self-training aimed at coping with POS tags unknown to the training set; (iii) the wrapping of a POS tagger with rule- based processing for dealing with recurrent non-standard tokens; and (iv) the guiding of out-of-domain parsing with predicted part-of- speech tags for unknown words and unknown (word, tag) pairs. Our systems ranked second and third out of eight in the constituency pars- ing track of the SANCL competition. 1 Introduction Complaining about the lack of robustness of statis- tical parsers whenever they are applied on out-of- domain text has almost became an overused cliché over the last few years. It remains true that such parsers only perform well on texts that are compa- rable to their training corpus, especially in terms of genre. As noted by Foster (2010; Foster et al. (2011), most studies on out-of-domain statistical parsing have been focusing mainly on slightly dif- ferent newswire texts (Gildea, 2001; McClosky et al., 2006b; McClosky et al., 2006a), biomedical data (Lease and Charniak, 2005; McClosky and Char- niak, 2008) or balanced corpora mixing different genres (Foster et al., 2007). The common point between these corpora is that they are edited texts, meaning that their underlying syntax, spelling, to- kenization and typography remain standard, even if they diverge slightly from the newswire genre. 1 Therefore, standard NLP tools can be used on such corpora. Now, new forms of electronic communication have emerged in the last few years, namely social me- dia and Web 2.0 communication media, either syn- chronous (micro-blogging) or asynchronous (fo- rums), thus the need for comprehensive ways of cop- ing with the new languages types carried by those media is becoming of crucial importance. If those unlimited stream of texts were all written with the same level of proficiency than our canonical treebanks, the problem, as it should be called, would be simply 2 a matter of domain adaptation. However, as shown by the challenges experienced during the Parsing the Web shared task (Petrov and McDonald, 2012), this is far from being the case. In this pa- per, we describe the architecture we set up in order to cope with prevalent issues in user generated con- tent, such as lexical data sparseness and noisy input. This architecture relies on the PCFG-LA framework (Petrov and Klein, 2007), as implemented by At- tia et al. (2010). We explore several techniques to augment robustness: (i) a lexical bridge technique (Candito et al., 2011) that uses unsupervised word clustering (Koo et al., 2008); (ii) a special instan- ciation of self-training aimed at coping with POS tags unknown to the training set; (iii) the wrap- ping of a POS tagger with rule-based processing 1 Putting aside speech specificities, present in some known data sets such as the BNC (Leech, 1992). 2 Cf. (McClosky et al., 2010) for numerous evidences of the non-triviality of that task. hal-00703124, version 1 - 31 May 2012 Author manuscript, published in "First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), an NAACL-HLT'12 workshop (2012)"