International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 03 Issue: 12 | Dec -2016 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 892 IMPROVED OUTLIER DETECTION USING CLASSIC KNN ALGORITHM K.T.Divya 1 , N.Senthil Kumaran 2 1 Research Scholar, Department of Computer Science, Vellalar college for Women, Erode, Tamilnadu, India 2 Assistant Professor, Dept. of Computer Applications, Vellalar College for Women, Erode, Tamilnadu, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Outlier detection is used for identification of items, events or observations which do not conform to an expected pattern or other items in dataset. The identification of instances that diverge from the expected behavior is a important task. Existing techniques provides a solution to the problem of anomaly detection in categorical data with a semi supervised setting. The outlier detection approach is based on distance learning for categorical attributes (DILCAs), a distance learning framework was introduced. The key intuition of DILCA is that the distance between the two values of a categorical attribute can be determined by the way, in which they co-occur with the values of other attributes in the data set. Existing techniques work well for fixed-schema data, with low dimensionality. certain applications require privacy preserving publishing of transactional data (or basket data), which involve hundreds or even thousands of dimensions, rendering existing methods unusable. This work proposes novel anonymization methods for sparse high-dimensional data. It is based on approximate Classic K-Nearest Neighbor search in high-dimensional spaces. These representations facilitate the formation of anonymized groups with low information loss, through an efficient linear-time heuristic. Among the proposed techniques, Classic KNN-search yields superior data utility, but incurs higher computational overhead. In addition dimensionality reduction technique is used. In this work healthcare dataset are used Key Words: Outlier Detection, Distance Learning, Semi Supervised Anomaly Detection, Classic KNN Algorithm. 1.INTRODUCTION Outlier detection (also anomaly detection) is the identification of items, events or observations in data mining which do not conform to an expected pattern or other items in a dataset. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results. In statistics, an outlier is an observation point that is distant from other observations. An outlier is because of due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. Outliers, being the most extreme observations, it may also include the sample maximum or sample minimum or both depending on whether they are extremely high or low. However, the sample maximum and minimum are not always outliers because they may not be unusually far from other observations. Outliers can be classified into three categories: point outliers, contextual outliers and collective outliers. Outliers can have many anomalous causes. A physical apparatus for taking the measurements may have suffered a transient malfunction. 2. LITERATURE SURVEY Dino Ienco, Ruggero G. Pensa, and Rosa Meo [2016] describe the problem of anomaly detection in categorical data with a semi supervised setting. The proposed approach is based on distance learning for categorical attributes (DILCAs), a distance learning framework was introduced. The key intuition of DILCA is that the distance between the two values of a categorical attribute Ai can be determined by the way, in which they co-occur with the values of other attributes in the data set [1]. Relevancy and redundancy are determined by the symmetric uncertainty (SU) measure that is shown to be a good estimate of the correlation between