International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 12 | Dec -2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 892
IMPROVED OUTLIER DETECTION USING CLASSIC KNN ALGORITHM
K.T.Divya
1
, N.Senthil Kumaran
2
1
Research Scholar, Department of Computer Science, Vellalar college for Women, Erode, Tamilnadu, India
2
Assistant Professor, Dept. of Computer Applications, Vellalar College for Women, Erode, Tamilnadu, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Outlier detection is used for identification of items,
events or observations which do not conform to an expected
pattern or other items in dataset. The identification of
instances that diverge from the expected behavior is a
important task. Existing techniques provides a solution to the
problem of anomaly detection in categorical data with a semi
supervised setting. The outlier detection approach is based on
distance learning for categorical attributes (DILCAs), a
distance learning framework was introduced. The key
intuition of DILCA is that the distance between the two values
of a categorical attribute can be determined by the way, in
which they co-occur with the values of other attributes in the
data set. Existing techniques work well for fixed-schema data,
with low dimensionality. certain applications require privacy
preserving publishing of transactional data (or basket data),
which involve hundreds or even thousands of dimensions,
rendering existing methods unusable. This work proposes
novel anonymization methods for sparse high-dimensional
data. It is based on approximate Classic K-Nearest Neighbor
search in high-dimensional spaces. These representations
facilitate the formation of anonymized groups with low
information loss, through an efficient linear-time heuristic.
Among the proposed techniques, Classic KNN-search yields
superior data utility, but incurs higher computational
overhead. In addition dimensionality reduction technique is
used. In this work healthcare dataset are used
Key Words: Outlier Detection, Distance Learning, Semi
Supervised Anomaly Detection, Classic KNN Algorithm.
1.INTRODUCTION
Outlier detection (also anomaly detection) is the
identification of items, events or observations in data mining
which do not conform to an expected pattern or other items
in a dataset. Anomalies are also referred to as outliers,
novelties, noise, deviations and exceptions. Outliers in input
data can skew and mislead the training process of machine
learning algorithms resulting in longer training times, less
accurate models and ultimately poorer results. In statistics,
an outlier is an observation point that is distant from other
observations. An outlier is because of due to variability in the
measurement or it may indicate experimental error; the
latter are sometimes excluded from the data set. Outliers,
being the most extreme observations, it may also include the
sample maximum or sample minimum or both depending on
whether they are extremely high or low. However, the
sample maximum and minimum are not always outliers
because they may not be unusually far from other
observations. Outliers can be classified into three categories:
point outliers, contextual outliers and collective outliers.
Outliers can have many anomalous causes. A physical
apparatus for taking the measurements may have suffered a
transient malfunction.
2. LITERATURE SURVEY
Dino Ienco, Ruggero G. Pensa, and Rosa Meo [2016]
describe the problem of anomaly detection in categorical
data with a semi supervised setting. The proposed approach
is based on distance learning for categorical attributes
(DILCAs), a distance learning framework was introduced.
The key intuition of DILCA is that the distance between the
two values of a categorical attribute Ai can be determined by
the way, in which they co-occur with the values of other
attributes in the data set [1]. Relevancy and redundancy are
determined by the symmetric uncertainty (SU) measure that
is shown to be a good estimate of the correlation between