International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Impact Factor (2012): 3.358 Volume 3 Issue 7, July 2014 www.ijsr.net Licensed Under Creative Commons Attribution CC BY Hybrid Approach for Outlier Detection in High Dimensional Dataset Rohini Balkrishna Gurav 1 , Sonali Rangdale 2 Student, ME [IT], SCOE, Sudumbare, Pune, India Guide and ME [IT] coordinator, SCOE, Sudumbare, Pune, India Abstract: An object that does not obey the behavior of normal data objects is called as Outlier. In many data analysis process, a large number of data are being recorded or sampled as data set. It is very important in data mining to find rare events, anomalies, exceptions etc. Outlier detection has important applications in many fields in which the data can contain high dimensions. Resulting the intended knowledge of outliers will become inefficient and even infeasible in high dimensional space. I devised an outlier detection structure which is based on clustering. Clustering is an unsupervised type of data mining and it does not require trained or labeled data. C o m b i n a t i o n o f density based and partition clustering method for taking improvement of both densities based and distance based outlier detection. W eights are a l l o c a t e d t o attributes depending upon their individual significance in mining task and weights are adaptive in natur e. Weighted attributes are useful to reduce or remove the effect of noisy attributes. In view of the challenges of streaming data, the schemes are incremental and adaptive to concept development. In high dimensional data the number of attributes associated with the dataset is very large and it makes the dataset unmanageable. Thus a Feature Extraction technique is used to reduce the number of attributes to a manageable value. Keywords: Attribute weighting, Dataset, DBSCAN, k-mean, unsupervised method 1. Introduction The Object in data set that does not obey to well defined concepts of expected behavior is called Outlier. Outlier detection is preprocessing step for data analysis. In which process of finding objects in the data set that do not follow to particular notions of expected behavior. Detected instances are not behaved like other instances in data set called outliers. It is also called as anomalies or surprises etc. Outlier detection is very essential process for much practical application as E-Commerce; intrusion detection; research etc. Existing methods are classified into 3 categories, supervised, semi-supervised and unsupervised. To detect outliers in high dimensional data using different clustering techniques. This outlier detection method can be used to find the anomalies in behavior of certain objects in the dataset. This holds importance in the field of Medicine, industries, Network Intrusion etc. Outlier detection in streaming data is very challenging because streaming data cannot be scanned multiple times and also new concepts may keep evolving in coming data over time. Inappropriate attributes can be termed as noisy attributes and such attributes further enlarge the challenge of working with data streams. The capacity of data in various fields such as medicine, internet transactions is enormous. The outlier detection strategy used for streaming data can be extended for various high dimensional data. Adaptive and dynamic approach can be used for outlier detection in high dimensional space. Detecting outliers in high dimensional data is fast process in data mining. The increasing use of high dimensional data increases the need of finding outliers. 2. Related Work Outlier detection is very important for data mining research community. Ramaswamy et al proposed a distance based outlier detection method. According to which, given parameters k and n, an object is an outlier ifno more than n- 1 other objects in the dataset have higher value for Dk than object o, where Dk(o) denotes the distance of kth nearest neighbor of object o. This idea is further developed in, where each data point is ranked by the sum of distances from its kth nearest neighbors. Breunig et al introduced the notion of the local outlier factor LOF, which captures the relative degree of outlierness of an object. Above described methods are either distance based or nearest neighbors based that are not suitable for outlier detection in data streams due to their high time complexity. He et al in presented new definition of outlier which they named as cluster-based local outlier, which provides importance to the local data behavior. Duan et al proposed a cluster based outlier detection algorithm which can detect both single point outliers and cluster-based outliers. But all these technique that I have defined above and many more are planned for stored static data sets and are not appropriate for data streams environment. 3. Proposed Work A. Problem statement Outlier detection in streaming data is very challenging because streaming data cannot be scanned multiple times and also new concepts may keep evolving in coming data over time. Irrelevant attributes can be termed as noisy attributes and such attributes further magnify the challenge of working with data streams . Paper ID: 20071403 1743