Formulating Distance Functions via the Kernel Trick Gang Wu Electrical Engineering University of California Santa Barbara, CA gwu@ece.ucsb.edu Edward Y. Chang Electrical Engineering University of California Santa Barbara, CA echang@ece.ucsb.edu Navneet Panda Computer Science University of California Santa Barbara, CA panda@cs.ucsb.edu ABSTRACT Tasks of data mining and information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a context- dependent (also application-, data-, and user-dependent) way. In this paper, we propose to learn a distance function by capturing the nonlinear relationships among contextual information provided by the application, data, or user. We show that through a process called the “kernel trick,” such nonlinear relationships can be learned ef- ficiently in a projected space. Theoretically, we substantiate that our method is both sound and optimal. Empirically, using several datasets and applications, we demonstrate that our method is effec- tive and useful. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms Keywords Distance function, kernel trick 1. INTRODUCTION At the heart of data-mining and information-retrieval tasks is a distance function that measures similarity between data instances. To date, most applications employ a variant of the Euclidean dis- tance for measuring similarity. However, to measure similarity meaningfully, an effective distance function ought to consider the idiosyncrasies of the application, data, and user (hereafter we refer to these factors as contextual information). The quality of the dis- tance function significantly affects the success in organizing data or finding meaningful results [1, 2, 5, 9, 11]. How do we consider contextual information in formulating a good distance function? One extension of the popular Euclidean Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00. distance (or more generally, the Lp-norm) is to weight the data at- tributes (features) based on their importance for a target task [2, 9, 18]. For example, for answering an ocean image-query, color features should be weighted higher. For answering an architec- ture image-query, shape and texture features may be more impor- tant. Weighting these features is equivalent to performing a linear transformation in the space formed by the features. Although lin- ear models enjoy the twin advantages of simplicity of description and efficiency of computation, this same simplicity is insufficient to model similarity for many real-world datasets. For example, it has been widely acknowledged in the image/video retrieval domain that a query concept is typically a nonlinear combination of perceptual features (color, texture, and shape) [16]. In this paper we propose performing a nonlinear transformation on the feature space to gain greater flexibility for mapping features to semantics. We name our method distance-function alignment (DA lign for short). The inputs to DA lign are a prior distance function, and con- textual information. Contextual information can be conveyed in the form of training data (discussed in detail in Section 2). For in- stance, in the information-retrieval domain, Web users can convey information via relevance feedback showing which documents are relevant to their queries. In the biomedical domain, physicians can indicate which pairs of proteins may have similar functions. DA lign transforms the prior function to capture the nonlinear relationships among the contextual information. The similarity scores of unseen data-pairs can then be measured by the transformed function to bet- ter reflect the idiosyncrasies of the application, data, and user. At first it might seem that capturing nonlinear relationships among contextual information can suffer from high computational com- plexity. DA lign avoids this concern by employing the kernel trick [3]. The kernel trick lets us generalize distance-based algorithms to operate in the projected space (defined next), usually nonlinearly re- lated to the input space. The input space (denoted as I ) is the origi- nal space in which data vectors are located (e.g., in Figure 1(a)), and the projected space (denoted as P ) is that space to which the data vectors are projected, linearly or nonlinearly, (e.g., in Figure 1(b)). The advantage of using the kernel trick is that, instead of explic- itly determining the coordinates of the data vectors in the projected space, the distance computation in P can be efficiently performed in I through a kernel function. Specifically, given two vectors xi and xj , kernel function K(xi , xj ) is defined as the inner product of φ(xi ) and φ(xj ), where φ is a basis function that maps the vectors xi and xj from I to P . The inner product between two vectors can be thought of as a measure of their similarity. Therefore, K(xi , xj ) returns the similarity of xi and xj in P . The distance between xi and xj in terms of the kernel is defined as d(xi , xj )= ‖φ(xi ) − φ(xj )‖2 = p K(xi , xi )+ K(xj , xj ) − 2K(xi , xj ). (1)