Pattern Recognition Letters 104 (2018) 8–14
Contents lists available at ScienceDirect
Pattern Recognition Letters
journal homepage: www.elsevier.com/locate/patrec
Exploiting covariate embeddings for classification using Gaussian
processes
Daniel Andrade
a,∗
, Akihiro Tamura
b
, Masaaki Tsuchida
c
a
Security Research Laboratories, NEC Corporation, Japan
b
Graduate School of Science and Engineering, Ehime University, Japan
c
DeNA Co., Ltd., Japan
a r t i c l e i n f o
Article history:
Received 16 May 2017
Available online 16 January 2018
Keywords:
Logistic regression
Auxiliary information of covariates
Gaussian process
Text classification
a b s t r a c t
In many logistic regression tasks, auxiliary information about the covariates is available. For example, a
user might be able to specify a similarity measure between the covariates, or an embedding (feature
vector) for each covariate, which is created from unlabeled data. In particular for text classification, the
covariates (words) can be described by word embeddings or similarity measures from lexical resources
like WordNet. We propose a new method to use such embeddings of covariates for logistic regression.
Our method consists of two main components. The first component is a Gaussian process (GP) with a
covariance function that models the correlations between covariates, and returns a noise-free estimate of
the covariates. The second component is a logistic regression model that uses these noise-free estimates.
One advantage of our model is that the covariance function can be adjusted to the training data using
maximum likelihood. Another advantage is that new covariates that never occurred in the training data
can be incorporated at test time, while run-time increases only linearly in the number of new covariates.
Our experiments demonstrate the usefulness of our method in situations when only small training data
is available.
© 2018 Published by Elsevier B.V.
1. Introduction
Classification is ubiquitous in many applications in machine
learning and statistics. However, for small training data, classifica-
tion performance is often insufficient, and, as a consequence, sev-
eral types of additional knowledge is included:
• unlabeled data using semi-supervised learning techniques [1],
• assumptions about the generation process of the data [2],
• auxiliary information about samples [3],
• auxiliary information about covariates [4].
Here, in this work, we focus on incorporating auxiliary infor-
mation about covariates that are given in the form of similarity in-
formation or embeddings. For text classification, where covariates
are single words, covariate embeddings can be easily acquired from
unlabeled documents using, for instance, word2vec [5] or GloVe
[6]. Alternatively, similarities between covariates can be manually
defined, and are available in resources like WordNet [7]. In the lat-
ter case, covariate embeddings can be easily learned using spectral
∗
Corresponding author.
E-mail address: s-andrade@cj.jp.nec.com (D. Andrade).
decomposition of similarity matrices (see Supplementary Material,
Section 2).
In order to incorporate the knowledge of covariates into logistic
regression, we propose to model the interaction of the covariates
by a Gaussian process (GP). The use of a Gaussian process allows
us to directly model the joint covariate distribution by an appro-
priate covariance function that depends on the covariate embed-
dings. Our model assumes that the true (unknown) value of the
covariates are generated from a GP, and the observed values are
due to additive noise. By recovering the true covariate values, our
model is able to adjust also values of related covariates that are
not observed in the sample. In particular, for text classification, our
method finds positive weights of semantically related words that
do not explicitly occur in the document. Our proposed method per-
forms effectively a kind of smoothing of the covariate vector that
is controlled by the parameters of the covariance function and the
noise variance.
Previous work using such covariate information mainly con-
centrates on designing ontology-specific kernels [4,8] or semantic
smoothing kernels from unlabeled data that cannot be adjusted to
the labeled training data at hand [9,10]. Wittek and Tan [11] pro-
poses a wavelet kernel that can incorporate distance information
between covariates. However, their method requires to create a
https://doi.org/10.1016/j.patrec.2018.01.011
0167-8655/© 2018 Published by Elsevier B.V.