Spam Outlier Detection in High Dimensional Data: Ensemble Subspace Clustering Approach Suresh S. Kapare, Bharat A. Tidke Computer Department, Savitribai Phule Pune University Flora Institute of Technology, Pune, Maharashtra, India Abstract— High Dimensional data is need of world as social networking sites, biomedical data, sports, etc. Many data sets are represented with hundreds or thousands of dimensions. Dimensions are increasing, so due to “Curse of Dimensionality”, traditional outlier detection methods not working efficiently. Increasing dimensions of data objects, makes difficult to find out points, which are not fitting in group (cluster), called Outlier. The outlier detection method has important applications in the field of fraud detection, network robustness analysis, error elimination in scientific data, sports data analysis and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Spam can be linked based or content based. Ensemble subspace clustering is paradigm in which spam outlier detection is done for high dimensional data sets is proposed in this paper. The proposed method divides original high dimensional data set in subspace clusters using subspace clustering algorithm. By using improved k-means algorithms outlier cluster is found, which is further merged with other clusters depending upon consensus function. Outlier cluster, which is not going to merge with any other subspace cluster, is called as final outlier. Keywords— Outlier, high dimensional data, subspace, ensemble, clustering. I. INTRODUCTION As we all proceeding with the automation in life, use of computer is mandatory. We are trying to handle all type of processing’s with the help of computers. Many fields of work like Engineering, Science, Biomedical, Agriculture, Finance, Sports, Education, Telecommunication, etc. are intended for automation. Automation includes computers. While accessing such a processes database is generated, but as all advances in fields makes these databases with multiple dimensions and these dimensions are counted in hundreds or thousands. Such a database is called as High Dimensional data. Now whenever we want to make analysis of such a high dimensional data. We need to make groups of data objects. The process of making groups is called as Clustering. As dimensions are increasing data is becoming sparse which is called as “curse of dimensionality”, means by applying set of dimensions over data point if we try to make clusters, then in that region, data points densely satisfying dimensions will be crowded and some points which are satisfying less dimensions in cluster will spread in region. Such a data points are called as outliers with respect to that cluster. But it may happen that outlier of on cluster may belong to other cluster or other set of dimensions densely. So finding outliers with respect to individual cluster is not efficient. It needs to check with other clusters too. Therefore various methods of outlier detection are not working efficiently in high dimensional data sets. II. LITURATURE SURVEY Now a days as technology is increasing in the field of medical, animation, sports, neural, analysis, reviews, etc. Data collected from such field is not simple data. Collected data might be defined with hundreds or thousands of its dimensions. Information retrieval from such high dimensional data is actually crucial work. And finding outliers from it is challenge. A. Outlier Detection in High Dimensional Data Jonathan Von Brunken [1] proposed method works that how measure of intrinsic dimensionality can be used to improve outlier detection technique in data with strongly varying intrinsic dimensionality. Intrinsic Dimensionality of every data object is calculated which gives number of variables required to describe a given data set. This local Intrinsic Dimensionality gives Intrinsic Dimensionality Outlier Score (IDOS) which is used to distinguish between inliers and outliers. ID of inliers increases the agreement with remaining inliers member of subspace and weakened agreement with non-members of cluster. This helps to separate out outliers from subspace. C. C. Aggarwal [2] proposed method work by finding lower dimensional projections which are locally sparse, and cannot be discovered easily by brute force techniques because of the number of combinations of possibilities. They proposed algorithm for outlier detection which find the outliers by studying the behaviour of projections from the data set. B. Ensemble Subspace Clustering Reza Ghaemi, Md. Nasir Sulaiman [4] proposed a survey on clustering ensembles techniques. They elaborated that many real world application domains such as data compression, data mining and pattern recognition need data clustering analysis. All these data has multidimensional. Therefore single clustering algorithm is not able to achieve clustering analysis. Therefore clustering ensemble technique will integrate all consensus of multiple clustering solutions into single consensus one. The principle of ensemble is to generate a set of different models, and then aggregated them into only one. We identify the main consensus functions commonly used in the clustering ensemble. Random subspaces are an excellent source of clustering diversity that provides different views of the data. Projective clustering is an active topic in data mining. Here, however, we are only concerned with the use Suresh S. Kapare et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (3) , 2015, 2326-2329 www.ijcsit.com 2326