Intelligenza Artificiale 11 (2017) 93–116
DOI 10.3233/IA-170109
IOS Press
93
Effective and scalable kernel-based
language learning via stratified Nystr ¨ om
methods
Danilo Croce
∗
, Simone Filice and Roberto Basili
University of Roma, Tor Vergata, Via del Politecnico 1, Roma, Italy
Abstract. Expressive but complex kernel functions, such as Sequence or Tree kernels, are usually underemployed in NLP
tasks as for their significant computational cost in both learning and classification stages. Recently, the Nystr¨ om methodology
for data embedding has been proposed as a viable solution to scalability problems. It improves scalability of learning processes
acting over highly structured data, by mapping data into low-dimensional compact linear representations of kernel spaces.
In this paper, a stratification of the model corresponding to the embedding space is proposed as a further highly flexible
optimization. Nystr¨ om embedding spaces of increasing sizes are combined in an efficient ensemble strategy: upper layers,
providing higher dimensional representations, are invoked on input instances only when the adoption of smaller (i.e., less
expressive) embeddings provides uncertain outcomes. Experimental results using different models of such an uncertainty
show that state-of-the-art accuracy on three semantic inference tasks can be obtained even when one order of magnitude
fewer kernel computations is carried out.
Keywords: Nystr¨ om method, scalability, kernel methods, structured language learning
1. Introduction
Statistical learning methods have been proven
successful in several Natural Language Processing
and Web retrieval tasks. In particular, kernel-based
learning ([45, 49]) has been largely applied in lan-
guage processing for alleviating the need of complex
features engineering (e.g., [48]). While ad-hoc fea-
tures have inspired successful approaches to language
learning, such as the early works on bayesian learn-
ing for semantic role labeling ([25]), kernels provide a
natural way to capture textual generalizations directly
operating over (possibly complex) linguistic struc-
tures. Sequence [6] or Tree kernels [9] are particularly
interesting as the feature spaces they implicitly gen-
erate reflect linguistic patterns (n-grams or parse tree
∗
Corresponding author: Danilo Croce, University of Roma,
Tor Vergata, Via del Politecnico 1, Roma, Italy. E-mail:
croce@info.uniroma2.it.
fragments) that correspond to very expressive con-
straints for the learning process. Moreover, proper
generalizations of lexical information (that is cru-
cial in many language processing tasks) have been
injected into the above kernels, as in ([13]) where
vector models of lexical semantics are combined with
grammatical knowledge from the involved trees: as
for examples, these kernels achieved state-of-the-art
performances on question classification tasks [1].
The modeling provided by kernels allows the learn-
ing algorithm to closely depend on the data semantics.
In statistical learning for natural language processing,
for example, tree kernels exploit structured repre-
sentation of the input data, i.e., parse trees, able to
capture complex grammatical and lexical constraints.
Expressive, and mostly data-driven, metrics are thus
induced trough kernel functions, whereas large anno-
tated data sets naturally provide meaningful topolo-
gies (in the implicit feature spaces) for a learning
algorithm (e.g., Support Vector Machine [49]). The
1724-8035/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved