Kernelizing the Output of Tree-Based Methods Pierre Geurts 1,2 pgeurts@ibisc.univ-evry.fr,p.geurts@ulg.ac.be Louis Wehenkel 2 l.wehenkel@ulg.ac.be Florence d’Alch´ e-Buc 1 dalche@ibisc.univ-evry.fr 1 IBISC FRE CNRS 2873 & Epigenomics Project, GENOPOLE, 523, Places des Terrasses, 91 Evry, France, 2 Department of EECS & CBIG/GIGA Centre of Genoproteomics, University of Li` ege, Sart Tilman, B28, B- 4000, Li` ege, Belgium Abstract We extend tree-based methods to the predic- tion of structured outputs using a kerneliza- tion of the algorithm that allows one to grow trees as soon as a kernel can be defined on the output space. The resulting algorithm, called output kernel trees (OK3), generalizes classi- fication and regression trees as well as tree- based ensemble methods in a principled way. It inherits several features of these methods such as interpretability, robustness to irrel- evant variables, and input scalability. When only the Gram matrix over the outputs of the learning sample is given, it learns the output kernel as a function of inputs. We show that the proposed algorithm works well on an im- age reconstruction task and on a biological network inference problem. 1. Introduction Extending statistical learning to structured output space is now surging as a new theoretical and practical challenge. In computational biology, natural language processing or more generally in pattern recognition, structured data such as strings, trees, graphs abound and require new tools to be mined relevantly. An ele- gant way to handle structured data is to make use of kernel methods that encapsulate the knowledge about the structure of a space of interest into the definition of a kernel. Already successful for structured inputs, original kernel-based methods have recently been pro- posed to address the structured output problem (We- ston et al., 2002; Cortes et al., 2005; Tsochantaridis et al., 2005; Taskar et al., 2005; Weston et al., 2005). Appearing in Proceedings of the 23 rd International Con- ference on Machine Learning, Pittsburgh, PA, 2006. Copy- right 2006 by the author(s)/owner(s). In this paper, we start from a different family of models, namely tree-based models.Tree-based meth- ods have proved to be useful when interpretability is required or when features need to be selected. If en- hanced by ensemble methods, they present high per- formances and are considered as a general and power- ful tool as soon as the attribute vector representation is appropriate. In this context, we propose a straightfor- ward extension of multiple output regression trees us- ing the kernel trick, noticing the fact that the variance- based score function used for evaluating splits requires only scalar product computations. Provided that we have defined a kernel on the output space, it thus be- comes possible to build a tree that maps subregions of input space defined by hyperplanes parallel to axis into hyperspheres defined in the feature space correspond- ing to the output kernel. The algorithm can also be used from a Gram matrix only to learn an approxi- mation of the underlying kernel as a function of the inputs. This new extension of trees which we called Output Kernel Trees (OK3) can also benefit from en- semble methods because of the linearity of the ensem- ble combination operator. Section 2 presents the kernelization of tree-based methods and its properties. Section 3 describes and comments numerical experiments on two problems: a pattern completion task (Weston et al., 2002) and a graph inference problem (Vert & Yamanishi, 2004). Section 4 provides some perspectives. 2. Output kernel trees The general problem of supervised learning may be formulated as follows: from a learning sample LS = {(x i ,y i )|i =1,...,N LS } with x i ∈X and y i ∈Y , find a function f : X→Y that minimizes the expecta- tion of some loss function over the joint distribution of input/output pairs: E x,y {ℓ(f (x),y)}. (1) hal-00341946, version 1 - 19 Jul 2009 Author manuscript, published in "Proc. of the 23rd International Conference on Machine Learning, United States (2006)" DOI : 10.1145/1143844.1143888