Pattern Recognition Letters 104 (2018) 8–14 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Exploiting covariate embeddings for classification using Gaussian processes Daniel Andrade a, , Akihiro Tamura b , Masaaki Tsuchida c a Security Research Laboratories, NEC Corporation, Japan b Graduate School of Science and Engineering, Ehime University, Japan c DeNA Co., Ltd., Japan a r t i c l e i n f o Article history: Received 16 May 2017 Available online 16 January 2018 Keywords: Logistic regression Auxiliary information of covariates Gaussian process Text classification a b s t r a c t In many logistic regression tasks, auxiliary information about the covariates is available. For example, a user might be able to specify a similarity measure between the covariates, or an embedding (feature vector) for each covariate, which is created from unlabeled data. In particular for text classification, the covariates (words) can be described by word embeddings or similarity measures from lexical resources like WordNet. We propose a new method to use such embeddings of covariates for logistic regression. Our method consists of two main components. The first component is a Gaussian process (GP) with a covariance function that models the correlations between covariates, and returns a noise-free estimate of the covariates. The second component is a logistic regression model that uses these noise-free estimates. One advantage of our model is that the covariance function can be adjusted to the training data using maximum likelihood. Another advantage is that new covariates that never occurred in the training data can be incorporated at test time, while run-time increases only linearly in the number of new covariates. Our experiments demonstrate the usefulness of our method in situations when only small training data is available. © 2018 Published by Elsevier B.V. 1. Introduction Classification is ubiquitous in many applications in machine learning and statistics. However, for small training data, classifica- tion performance is often insufficient, and, as a consequence, sev- eral types of additional knowledge is included: unlabeled data using semi-supervised learning techniques [1], assumptions about the generation process of the data [2], auxiliary information about samples [3], auxiliary information about covariates [4]. Here, in this work, we focus on incorporating auxiliary infor- mation about covariates that are given in the form of similarity in- formation or embeddings. For text classification, where covariates are single words, covariate embeddings can be easily acquired from unlabeled documents using, for instance, word2vec [5] or GloVe [6]. Alternatively, similarities between covariates can be manually defined, and are available in resources like WordNet [7]. In the lat- ter case, covariate embeddings can be easily learned using spectral Corresponding author. E-mail address: s-andrade@cj.jp.nec.com (D. Andrade). decomposition of similarity matrices (see Supplementary Material, Section 2). In order to incorporate the knowledge of covariates into logistic regression, we propose to model the interaction of the covariates by a Gaussian process (GP). The use of a Gaussian process allows us to directly model the joint covariate distribution by an appro- priate covariance function that depends on the covariate embed- dings. Our model assumes that the true (unknown) value of the covariates are generated from a GP, and the observed values are due to additive noise. By recovering the true covariate values, our model is able to adjust also values of related covariates that are not observed in the sample. In particular, for text classification, our method finds positive weights of semantically related words that do not explicitly occur in the document. Our proposed method per- forms effectively a kind of smoothing of the covariate vector that is controlled by the parameters of the covariance function and the noise variance. Previous work using such covariate information mainly con- centrates on designing ontology-specific kernels [4,8] or semantic smoothing kernels from unlabeled data that cannot be adjusted to the labeled training data at hand [9,10]. Wittek and Tan [11] pro- poses a wavelet kernel that can incorporate distance information between covariates. However, their method requires to create a https://doi.org/10.1016/j.patrec.2018.01.011 0167-8655/© 2018 Published by Elsevier B.V.