Pattern Recognition 89 (2019) 161–171
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Integration of deep feature extraction and ensemble learning for
outlier detection
Debasrita Chakraborty
a
, Vaasudev Narayanan
b
, Ashish Ghosh
a,∗
a
Machine Intelligence Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata, 700108, India
b
Department of Computer Science and Engineering, Indian Institute of Technology, Dhanbad, 826004, India
a r t i c l e i n f o
Article history:
Received 27 June 2018
Revised 7 December 2018
Accepted 2 January 2019
Available online 3 January 2019
Keywords:
Deep learning
Autoencoders
Probabilistic neural networks
Ensemble learning
Outlier detection
a b s t r a c t
It is obvious to see that most of the datasets do not have exactly equal number of samples for each
class. However, there are some tasks like detection of fraudulent transactions, for which class imbalance
is overwhelming and one of the classes has very low (even less than 10% of the entire data) amount of
samples. These tasks often fall under outlier detection. Moreover, there are some scenarios where there
may be multiple subsets of the outlier class. In such cases, it should be treated as a multiple outlier type
detection scenario. In this article, we have proposed a system that can efficiently handle all the afore-
mentioned problems. We have used stacked autoencoders to extract features and then used an ensemble
of probabilistic neural networks to do a majority voting and detect the outliers. Such a system is seen
to have a better and reliable performance as compared to the other outlier detection systems in most
of the datasets tested upon. It is seen that use of autoencoders clearly enhanced the outlier detection
performance.
© 2019 Elsevier Ltd. All rights reserved.
1. Introduction
Outliers play an important role in defining the nature of a
dataset. They are certain interesting points in data that do not
conform to the expected or the natural behavior of the dataset.
They are usually anomalies, exceptions, discordant observations,
surprises, peculiarities, aberrations, or contaminants in different
application domains. An outlier is an observation in the data that
is highly unlikely provided a model is built that generates the data
[1]. In most of the practical cases, the model is abstract like that of
finding a fraudulent credit card transaction in millions of genuine
transactions. The data may also have multiple types of outliers like
different types of intrusions in a network. One might consider the
intrusions as a single outlier class and approach the problem in a
binary or a multi-class fashion. However, the practicality of such an
approach is questionable as intrusions are highly diverse and may
have different reasons to appear in the data. There are many such
cases which make single type outlier detection and multiple type
outlier detection a crucial part of data analysis process. Figs. 1 and
2 show the two approaches. In the former one, the diversity of the
outlier types are not considered and in the latter, the diversity of
the outlier types are considered.
∗
Corresponding author.
E-mail address: ash@isical.ac.in (A. Ghosh).
There is however a bottleneck that most of the algorithms get
stuck at. Chances of finding an outlier in any dataset is extremely
rare. In most of the cases, the number of samples from the out-
lier class is even below 10% of the total number of samples in the
entire training set. Identifying them becomes one of the difficult
problems in data analysis [2]. Sampling techniques are usually not
preferred for outlier detection cases. Oversampling a minority (or
an outlier) class or undersampling a majority (or the inlier) class
usually affects the generalisation capability [3]. Moreover, the ex-
treme imbalance (below 10%) is a major challenge for many algo-
rithms. This is because the presence of an outlier is often mislead-
ing to algorithms like clustering, classification or regression. There
may also be cases where the outliers may arise due to different
reasons and may have diversity among them. When there are mul-
tiple types of outliers in the dataset, each having a different prop-
erty, taking all the outlier classes as a single class may not make
any sense. This differs from a multi-class imbalance problem as in
this case the outliers consist of less than 10% of the entire dataset.
This article proposes and investigates a new supervised out-
lier detection framework inspired by the projection methodology
through deep learning. It is shown analytically that the proposed
method alleviates the aforesaid drawbacks of the existing standard
approaches (we have not considered the unsupervised or semi-
supervised outlier detection methodologies in order to give a fair
comparison with the proposed method). We have done multiple
experiments on several datasets to prove whether the non-linear
https://doi.org/10.1016/j.patcog.2019.01.002
0031-3203/© 2019 Elsevier Ltd. All rights reserved.