Building a large grammar for Italian Alessandro Mazzei, Vincenzo Lombardo Dipartimento di Informatica, Universit` a di Torino c.Svizzera 185, 10149 Torino, Italy mazzei, vincenzo @di.unito.it Abstract We describe the construction of a large lexicalized tree adjoining grammar for Italian, automatically extracted from an annotated corpus. Wefirst introduce the TUT, a dependency style treebank for Italian, then we illustrate the algorithm that we have designed to extract the grammar, and finally we report two experiments about parsing complexity and coverage of the extracted grammar. 1. Introduction Building a wide coverage grammar plays a key role in the realization of a language understanding system. The traditional methods to develop a wide grammar need a great deal of human-effort (Black et al., 1993), but in the last years with the advent of annotated corpora, the most im- mediate way to build wide-coverage grammars is to extract them from treebanks. To extract a grammar from a treebank two factors have a primary importance: the type of annota- tion used in the treebank and the type of the grammatical formalism. The Turin University Treebank (TUT) is an ongoing project of the University of Turin on the construction of a dependency style treebank for Italian (Bosco et al., 2000): each sentence is semi-automatically annotated with depen- dency relations that form a tree, and relations are of mor- phological, syntactic and semantic types. The corpus is very varied, and contains texts from newspapers, maga- zines, novels and press news. Its current size is 1500 anno- tated sentences (33.868 words), although in this work we report data on 1200 sentences. In figure 1 there is the anno- tation for the sentence belonging to the corpus “La norma non ha mai trovato applicazione”. Each node in the tree contains a terminal word, a number that refers to the po- sition in the linear order of the sentence, and the POS tag of the word. Each label on the edges of the tree represents a head-dependent relation For instance the relation ADVB- RMOD-NEG, that links the dependent adverb “non” with the head verb “trovato”, contains the syntactic information that the adverb is a modifier of the verb. Figure 1: Dependencies tree with basic TUT annotation for the sentence “La norma non ha mai trovato applicazione”. Several grammatical formalisms have been proposed with the aim to capture the linguistic information present in the treebanks. Lexicalized Tree Adjoining Grammar (LTAG) is a well known grammar formalism that has in- teresting mathematical and linguistic properties (Joshi and Schabes, 1997) and has been applied in several applicative tasks. LTAG grammar consists of elementary trees (instead of rules) that are combined through substitution and adjunc- tion to form syntactic trees. Elementary trees can be initial (argumental) trees or auxiliary (modifier) trees. LTAG is a lexicalized formalism because for each elementary tree there is terminal word on the frontier called anchor. The anchor of the tree defines the semantic content of the el- ementary tree: the elementary tree can be seen as an ex- tended projection of the anchor. A number of wide cover- age LTAGs have been developed for a number of languages (English (Doran et al., 2000), French (Abeill´ e and Candito, 2000), German (Neumann, 2003)). We present an algorithm to convert the dependencies trees belonging to the TUT to constituency trees, and then an algorithm to extract from these constituency trees a lex- icalized tree adjoining grammar. To our knowledge this is the first attempt to construct a wide coverage LTAG for Ital- ian. 2. From dependencies to constituents In order to extract the LTAG grammar, we converted the TUT treebank dependency format to a constituency format, and then we adapted the algorithm in (Xia, 2001). This al- gorithm was originally designed to build a constituency tree close to the trees of Penn treebank (Marcus et al., 1993). Given a level of the dependency tree with a Head and several Dependents, the conversion to constituency relies on three mappings: the projection chain of the terminal category corresponding to the Head, i.e. the chain of non terminal nodes projected by that terminal; the projection chains of the terminal categories corresponding to each Dependent; the attachment of each Dependent projection chain to the Head projection chain. These mappings depend on the POS tags of the nodes in the dependency tree, the ar- gument or modifier role labelled on the edge, the relative position of head and dependent (i.e. whether the dependent word is on the left or the right of the head, respectively). The algorithm that converts the dependency annotation in the constituency annotation features two stages. In the first stage it builds a binary constituency tree with unla- belled non terminals. Starting from the root, it makes a 51