A Distributed Approach to Detect Outliers in Very Large Data Sets Fabrizio Angiulli 1 , Stefano Basta 2 , Stefano Lodi 3 , and Claudio Sartori 3 1 DEIS-UNICAL Via Pietro Bucci, 41C – 87036 Rende (CS), Italy f.angiulli@deis.unical,it 2 ICAR-CNR Via Pietro Bucci, 41C – 87036 Rende (CS), Italy basta@icar.cnr.it 3 DEIS-UNIBO Via Risorgimento, 2 – 40136 Bologna, Italy {stefano.lodi,claudio.sartori}@unibo.it Abstract. We propose a distributed approach addressing the problem of distance-based outlier detection in very large data sets. The presented algorithm is based on the concept of outlier detection solving set ([1]), which is a small subset of the data set that can be provably used for predicting novel outliers. The algorithm exploits parallel computation in order to meet two basic needs: (i) the reduction of the run time with respect to the centralized version and (ii) the ability to deal with dis- tributed data sets. The former goal is achieved by decomposing the over- all computation into cooperating parallel tasks. Other than preserving the correctness of the result, the proposed schema exhibited excellent performances. As a matter of fact, experimental results showed that the run time scales up with respect to the number of nodes. The latter goal is accomplished through executing each of these parallel tasks only on a portion of the entire data set, so that the proposed algorithm is suit- able to be used over distributed data sets. Importantly, while solving the distance-based outlier detection task in the distributed scenario, our method computes an outlier detection solving set of the overall data set of the same quality as that computed by the corresponding centralized method. 1 Introduction Detecting outliers in large data sets, that is finding out examples considerably dissimilar, exceptional or inconsistent with respect to the remaining data [7], is an important research field that has practical applications in several domains such as fraud detection, network intrusion detection, data cleaning, medical di- agnosis, and marketing segmentation. Unsupervised approaches to outlier de- tection are able to discriminate each datum as normal or exceptional when no training examples are available. Among the unsupervised approaches, distance- based outlier detection methods distinguish an object as outlier on the basis of P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part I, LNCS 6271, pp. 329–340, 2010. c Springer-Verlag Berlin Heidelberg 2010