Intelligenza Artificiale 11 (2017) 93–116 DOI 10.3233/IA-170109 IOS Press 93 Effective and scalable kernel-based language learning via stratified Nystr ¨ om methods Danilo Croce , Simone Filice and Roberto Basili University of Roma, Tor Vergata, Via del Politecnico 1, Roma, Italy Abstract. Expressive but complex kernel functions, such as Sequence or Tree kernels, are usually underemployed in NLP tasks as for their significant computational cost in both learning and classification stages. Recently, the Nystr¨ om methodology for data embedding has been proposed as a viable solution to scalability problems. It improves scalability of learning processes acting over highly structured data, by mapping data into low-dimensional compact linear representations of kernel spaces. In this paper, a stratification of the model corresponding to the embedding space is proposed as a further highly flexible optimization. Nystr¨ om embedding spaces of increasing sizes are combined in an efficient ensemble strategy: upper layers, providing higher dimensional representations, are invoked on input instances only when the adoption of smaller (i.e., less expressive) embeddings provides uncertain outcomes. Experimental results using different models of such an uncertainty show that state-of-the-art accuracy on three semantic inference tasks can be obtained even when one order of magnitude fewer kernel computations is carried out. Keywords: Nystr¨ om method, scalability, kernel methods, structured language learning 1. Introduction Statistical learning methods have been proven successful in several Natural Language Processing and Web retrieval tasks. In particular, kernel-based learning ([45, 49]) has been largely applied in lan- guage processing for alleviating the need of complex features engineering (e.g., [48]). While ad-hoc fea- tures have inspired successful approaches to language learning, such as the early works on bayesian learn- ing for semantic role labeling ([25]), kernels provide a natural way to capture textual generalizations directly operating over (possibly complex) linguistic struc- tures. Sequence [6] or Tree kernels [9] are particularly interesting as the feature spaces they implicitly gen- erate reflect linguistic patterns (n-grams or parse tree Corresponding author: Danilo Croce, University of Roma, Tor Vergata, Via del Politecnico 1, Roma, Italy. E-mail: croce@info.uniroma2.it. fragments) that correspond to very expressive con- straints for the learning process. Moreover, proper generalizations of lexical information (that is cru- cial in many language processing tasks) have been injected into the above kernels, as in ([13]) where vector models of lexical semantics are combined with grammatical knowledge from the involved trees: as for examples, these kernels achieved state-of-the-art performances on question classification tasks [1]. The modeling provided by kernels allows the learn- ing algorithm to closely depend on the data semantics. In statistical learning for natural language processing, for example, tree kernels exploit structured repre- sentation of the input data, i.e., parse trees, able to capture complex grammatical and lexical constraints. Expressive, and mostly data-driven, metrics are thus induced trough kernel functions, whereas large anno- tated data sets naturally provide meaningful topolo- gies (in the implicit feature spaces) for a learning algorithm (e.g., Support Vector Machine [49]). The 1724-8035/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved