Optimal Bayesian Classification in Nonstationary Streaming
Environments
Jehandad Khan, Nidhal Bouaynaya, Robi Polikar
Abstract—A novel method of classifying data drawn from a
nonstationary distribution with drifting mean and variance is
presented. The novelty of the approach is based on splitting the
problem of tracking a nonstationary distribution into separate
classification and time series state estimation problems. State
space models for drift in both the mean and variance are
presented, which are then successfully tracked using a Kalman
filter and a particle filter for the linear and non-linear parts
respectively. Preliminary results, which show the promising po-
tential of the approach, are also presented, along with concluding
remarks for potential uses of the proposed approach.
I. I NTRODUCTION
Most classification algorithms rely on the underlying as-
sumption that the distribution generating the data is stationary.
However, this is a very restricting assumption, since many real-
world problems generate data whose underlying distributions
change over time. Real world applications that generate such
nonstationary data include climate change, remote-sensing
applications, metagenomic applications (genomic analysis of
environmental samples, where species abundance change dra-
matically along unknown environmental gradients), analysis of
web-user interest, identification of financial fraud from trans-
action data, prediction of energy demand and pricing, among
many others. Also relevant are installations with limited access
(e.g., oil pipelines, building foundations, extreme geographic
locations, etc.), where subsequent data later collected from
embedded sensors can be subject to a variety of nonstationary
changes; e.g., cracks from freeze-thaw cycles, shifting tec-
tonic plates, etc. The stationarity assumption is often used to
simplify the mathematical setting of the problem, and thus
also simplify the derived solutions. However, this simplifying
assumption forces the problem into a subspace of the original
problem, often resulting in suboptimal solutions. Taking the
nonstationary nature of the problem into consideration would
allow us to take advantage of the full richness of the data,
resulting in more accurate classification and prediction in
tracking nonstationary environments.
The nonstationarity, also known as concept drift, can be
treated using a variety of approaches such as domain adapta-
tion [1] [2], covariate shift [3] or more generally as sample
selection bias [4], or with specific ensemble based approaches
J. Khan, N. Bouaynaya and R. Polikar are with the Dept. of
Electrical & Computer Engineering at Rowan University. (email:
khanj6@students.rowan.edu, {bouaynaya,polikar}@rowan.edu).
This material is based upon work supported by the National Science
Foundation grants ECCS-1310496, CRI CNS-0855248, EPS-0701890, EPS-
0918970, MRI CNS-0619069, and OISE-0729792. This project is also sup-
ported by Award Number R01GM096191 from the National Institute Of
General Medical Sciences (NIH/NIGMS).
such as Learn
++
.NSE [5], DWM [6] and SEA [7]. These
techniques acknowledge that the probability distribution which
generated the data at any point in time is different from the
probability distribution on which the classifier will make its
prediction i.e., p
s
(x, y) = p
t
(x, y) where p
s
and p
t
are the
source and target distributions, respectively, for the features x
and labels y. These approaches rely on different assumptions
about the source and target distributions: for example, in
covariate shift it is assumed that the support of p
s
(x, y)
contains the support of p
t
(x, y) [8], thus the source and target
distributions may be different but still are related. Moreover,
it is also assumed that there is sufficient amount of labeled
and unlabeled data available in the source and target domain,
respectively.
The aforementioned algorithms also require a large amount
of labeled data (at least from the source domain), rendering the
availability or the high cost associated with obtaining labeled
data a potential obstacle in using these approaches. In medical
diagnostics, for example, it is highly desirable that the learning
algorithm is trained using a minimum number of subjects,
typically due to the scarcity of consenting subjects, the mon-
etary cost associated with running diagnostic tests, or even
the rarity of the disease. Semi Supervised Learning (SSL) has
been used for such scenarios of limited availability of labeled
training data, wherein the class information is propagated from
small number of labeled data to more abundant unlabeled
data instances [9] using such approaches as density separation,
decision boundary detection or by constructing a graph. The
primary focus of SSL techniques has been in stationary data
environments, but there has been some recent advances that
deal with data generated from nonstationary data distributions.
These methods still have the canonical SSL implementation
at their core with an exterior modification that caters for
the drifting probability distributions. However, most such ap-
proaches still require that labeled data be available at each time
point [10]. Active Learning (AL) is another approach used to
tackle the limited data availability by selecting instances from
the data that provide maximum information about the class
boundaries, and then requesting the corresponding labels. AL
algorithms rely on the imminent availability of the labels for
any requested instances, an unrealistic expectation in certain
applications.
Our work focuses on the nonstationarity of the source and
target distributions generating the data, as well as the scarcity
of labeled data instances. As stated above, these two problems
are typically dealt with separately; but it is not unusual that
both scenarios manifest themselves at the same time, hence
2014 International Joint Conference on Neural Networks (IJCNN)
July 6-11, 2014, Beijing, China
978-1-4799-1484-5/14/$31.00 ©2014 IEEE 609