Resource-bounded Outlier Detection using Clustering Methods 1 Luis TORGO a,2 , and Carlos SOARES a a LIAAD/INESC Porto LA - FEP, University of Porto, Portugal Abstract. This paper describes a methodology for the application of hierarchical clustering methods to the task of outlier detection. The methodology is tested on the problem of cleaning Official Statistics data. The goal is to detect erroneous foreign trade transactions in data collected by the Portuguese Institute of Statistics (INE). These transactions are a minority, but still they have an important impact on the statistics produced by the institute. The detectiong of these rare errors is a manual, time- consuming task. This type of tasks is usually constrained by a limited amount of available resources. Our proposal addresses this issue by producing a ranking of outlyingness that allows a better management of the available resources by allo- cating them to the cases which are most different from the other and, thus, have a higher probability of being errors. Our method is based on the output of stan- dard agglomerative hierarchical clustering algorithms, resulting in no significant additional computational costs. Our results show that it enables large savings by selecting a small subset of suspicious transactions for manual inspection, which, nevertheless, includes most of the erroneous transactions. In this study we com- pare our proposal to a state of the art outlier ranking method (LOF) and show that our method achieves better results on this particular application. The results of our experiments are also competitive with previous results on the same data. Finally, the outcome of our experiments raises important questions concerning the method currently followed at INE concerning items with small number of transactions. Keywords. Outlier detection, outlier ranking, hierarchical clustering, data cleaning Introduction This paper addresses the problem of detecting errors in foreign trade data (INTRASTAT) collected by the Portuguese Institute of Statistics (INE). The objective is to identify the transactions that are most likely to contain errors. The selected transactions will then be manually analyzed by specialized staff and corrected if an error really exists. The effort required for manual analysis ranges from simply checking the form that was submitted to a more contacts with the company that made the transaction to confirm whether the values declared are the correct ones. In any case, the process requires the involvement of expensive human resources and has significant costs to INE. 2 Corresponding Author: Luis Torgo, LIAAD, Rua de Ceuta, 118, 6., 4050-190 Porto, Portugal; E-mail: ltorgo@inescporto.pt.