Visual Data Mining for Identification of Patterns and Outliers in Weather Stations’ Data Jos´ e Roberto M. Garcia, Antˆ onio Miguel V. Monteiro, and Rafael D.C. Santos Brazilian National Institute for Space Research, Av dos Astronautas, 1.758, Jd. Granja - CEP 12227-010, S˜ ao Jos´ e dos Campos – S˜ ao Paulo – Brasil Abstract. Quality control of climate data obtained from weather sta- tions is essential to ensure reliability of research and services based on this data. One way to perform this control is to compare data received from one station with data from other stations which somehow are ex- pected to show similar behavior. The purpose of this work is to evaluate some visual data mining techniques to identify groupings (and outliers of these groupings) of weather stations using historical precipitation data in a specific time interval. We present and discuss the techniques’ details, variants, results and applicability on this type of problem. Keywords: Visual data mining, clustering, self-organizing map, fuzzy C-means. 1 Introduction Observational data obtained from weather stations is important due to its use on generating weather and climate numeric predictions, evaluating models results and making climatic research [1], so having reliable data is an essential issue to make reliable research and applications. However, the data is not completely reliable: some weather stations are still human operated, which often are subject to reporting errors; and even the automatic ones depend on hardware and net- work communication which can pollute the data [2]. A quality control system is clearly required to verify the data’s quality. At the Brazilian National Institute for Space Research’s (INPE) CPTEC (Brazilian National Center for Weather Prediction and Climate Studies) there is a 3-level quality control system for weather stations data. The first approach verifies whether the data is inside upper and lower limits to the variable; the second uses arbitrary geographic rectangular regions and limits on the variables on these regions, and the third one uses limits for each variable and specific weather station. These controls aims to reject spurious data and classify suspi- cious data [3]. The problem with these approaches is that the bounds used (for the variable and geographic regions) are not natural and can filter important data from datasets for analysis. Moreover, the number of rejections increases the work of the data administrator that need to analyze, one by one, all the rejected data. H. Yin et al. (Eds.): IDEAL 2012, LNCS 7435, pp. 245–252, 2012. c Springer-Verlag Berlin Heidelberg 2012