Privacy Preserving Outlier Detection using Locality Sensitive Hashing Nisarg Raval, Madhuchand Rushi Pillutla, Piysuh Bansal, Kannan Srinathan, C. V. Jawahar International Institute of Information Technology Hyderabad, India {nisarg.raval@, rushi.pillutla@, piyush bansal@}research.iiit.ac.in {srinathan@, jawahar@}iiit.ac.in Abstract—In this paper, we give approximate algorithms for privacy preserving distance based outlier detection for both horizontal and vertical distributions, which scale well to large datasets of high dimensionality in comparison with the existing techniques. In order to achieve efficient private algorithms, we introduce an approximate outlier detection scheme for the centralized setting which is based on the idea of Locality Sensitive Hashing. We also give theoretical and empirical bounds on the level of approximation of the proposed algorithms. Keywords-privacy; outlier detection; LSH I. I NTRODUCTION Data Mining, which is the process of extracting patterns from large datasets, has become an important research area due to the exponential growth of digital data and the storage capabilities. However, in case of large datasets collected from various input sources, oftentimes the data is distributed across the network, rising concerns for privacy and security while performing distributed data mining. To overcome this problem, Privacy Preserving Data Mining (PPDM) methods have been proposed, which are based on Data randomization techniques and Cryptographic techniques. PPDM methods based on the latter were introduced by Lindell and Pinkas in [7]. In that paper, an algorithm for Privacy preserving ID3 Classification was described. Subsequently, privacy preserving algorithms have been proposed for various data mining tasks such as association rule mining, classification, clustering and outlier detection. Privacy preserving outlier detection (PPOD) was intro- duced by Vaidya et al. in [11]. They use the definition for distance based outliers provided in [6], and give PPOD algorithms for both horizontal and vertical partitioning of data. Subsequently, a PPOD algorithm using the k-nearest neighbor based definition [9] was given in [12], consider- ing only vertical partitioning. However, all of the above mentioned algorithms have quadratic communication and computation complexities in the database size, making them infeasible while dealing with large datasets. To the best of our knowledge, no other work in the field of PPDM based on cryptographic techniques has addressed distance based outlier detection. Privacy preserving density based outlier detection algorithms have been proposed in [2], [10]. In this paper, we propose approximate PPOD algorithms for both horizontal and vertical partitioning of data. As opposed to the current PPOD algorithms which provide privacy for already existing outlier detection algorithms, we develop a new outlier detection scheme for the centralized setting in order to achieve efficient algorithms in private settings. The centralized scheme is based on our previous work on approximate outlier detection [8] which uses Lo- cality Sensitive Hashing (LSH) technique [5]. We also give theoretical bounds on the level of approximation and provide the corresponding empirical evidence. The computational complexity of our centralized algo- rithm is O(ndL) for d-dimensional dataset with n objects. The parameter L is defined as n 1/1+ǫ , where ǫ> 0 is an approximation factor. The computational complexity of our PPOD algorithms in both horizontally and vertically dis- tributed settings is same as that of the centralized algorithm, which is a considerable improvement over the previous known result of O(n 2 d). The communication complexity in vertically distributed setting is O(nL) and in horizontally distributed setting it is O(N b L log n); where N b << n, is the average number of bins created during each of the L iterations of LSH. Thus in both cases, we show a significant improvement over the existing communication complexity, which is quadratic in dataset size. Further, the communication cost of our privacy preserving algorithm in horizontal distribution is independent of data dimensionality and hence works very efficiently even for datasets of very large dimensionality as opposed to the existing algorithms. However, we achieve the above mentioned improvements at the cost of an approximate solution to outlier detection. II. OVERVIEW AND BACKGROUND Our outlier detection scheme uses the definition for a distance based outlier proposed by Knorr et al. [6]. Definition 1. DB(p t ,d t ) outlier: An object o in a dataset D is a DB(p t ,d t ) outlier if at least fraction p t of the objects in D lie at a distance greater than d t from o. In our approach, we use the converse of this definition and consider an object to be a non-outlier if it has enough neighbors (p t ) within distance d t , where p t = (1 - p t ) ×|D|. Since the fraction p t is very high (usually set to 0.9988), the modified point threshold p t will be very less compared to the number of objects in D. This allows us to easily detect 2011 11th IEEE International Conference on Data Mining Workshops 978-0-7695-4409-0/11 $26.00 © 2011 IEEE DOI 10.1109/ICDMW.2011.141 674