Available online at www.sciencedirect.com
Computer Speech and Language 27 (2013) 1085–1104
Bridge the gap between statistical and hand-crafted grammars
Ali Basirat, Heshaam Faili
∗
Laboratory of Natural Language and Text Processing, School of Electrical & Computer Engineering, College of Engineering,
University of Tehran, Tehran, Iran
Received 4 November 2011; received in revised form 19 September 2012; accepted 5 February 2013
Available online 27 February 2013
Abstract
LTAG is a rich formalism for performing NLP tasks such as semantic interpretation, parsing, machine translation and information
retrieval. Depend on the specific NLP task, different kinds of LTAGs for a language may be developed. Each of these LTAGs is
enriched with some specific features such as semantic representation and statistical information that make them suitable to be used
in that task. The distribution of these capabilities among the LTAGs makes it difficult to get the benefit from all of them in NLP
applications.
This paper discusses a statistical model to bridge between two kinds LTAGs for a natural language in order to benefit from the
capabilities of both kinds. To do so, an HMM was trained that links an elementary tree sequence of a source LTAG onto an elementary
tree sequence of a target LTAG. Training was performed by using the standard HMM training algorithm called Baum–Welch. To
lead the training algorithm to a better solution, the initial state of the HMM was also trained by a novel EM-based semi-supervised
bootstrapping algorithm.
The model was tested on two English LTAGs, XTAG (XTAG-Group, 2001) and MICA’s grammar (Bangalore et al., 2009) as the
target and source LTAGs, respectively. The empirical results confirm that the model can provide a satisfactory way for linking these
LTAGs to share their capabilities together.
© 2013 Elsevier Ltd. All rights reserved.
Keywords: Tree adjoining grammar; LTAG; Hidden Markov model; XTAG; MICA
1. Introduction
Tree adjoining grammar (TAG), which was initially introduced by Joshi et al. (1975), is a tree generating system
that forms the object language by a set of derived trees. This formalism as an extension of context free grammars
(CFGs) is classified in the mildly context sensitive grammars (MCSGs), which itself is a grammatical class between
the context-free and context sensitive-grammars (Joshi, 1985).
In the lexicalized case, the elementary structures of the lexicalized tree-adjoining grammars (LTAGs) are assigned
to the lexical items of the language. These elementary structures are called elementary trees and the lexical items
assigned to them are called the anchors. Each elementary tree of a LTAG defines a syntactic environment in which its
anchor can appear (Bangalore and Joshi, 1999). There are two kinds of elementary trees: initial trees and auxiliary
This paper has been recommended for acceptance by E. Briscoe.
∗
Corresponding author. Tel.: +98 21 82089717; fax: +98 21 88633029.
E-mail addresses: a.basirat@srbiau.ac.ir (A. Basirat), hfaili@ut.ac.ir (H. Faili).
0885-2308/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.csl.2013.02.001