Active Sampling for Data Mining Emanuele Olivetti and Paolo Avesani ITC-iRST Via Sommarive 14 - I-38050 Povo (TN) - Italy {olivetti,avesani}@itc.it Abstract. Data mining is a complex process that aims to derive an accurate pre- dictive model starting from a collection of data. Traditional approaches assume that data are given in advance and their quality, size and structure are independent parameters. In this paper we argue that an extended vision of data mining should include the step of data acquisition as part of the overall process. Moreover the static view should be replaced by an evolving perspective that conceives the data mining as an iterative process where data acquisition and data analysis repeatedly follow each other. A decision support tool based on data mining will have to be extended accord- ingly. Decision making will be concerned not only with a predictive purpose but also with a policy for a next data acquisition step. A successful data acquisition strategy will have to take into account both future model accuracy and the cost associated to the acquisition of each feature. To find a trade off between these two components is an open issue. A framework to focus this new challenging problem is proposed. 1 Introduction Very often there are initiatives to provide inductive evidence as explanation of a com- plex phenomena although a collection of data is not available in advance. It is straight- forward that in this context a data acquisition plan becomes a strategic preliminary or intermediate goal. To arrange a data acquisition plan could be not trivial if the collection and the recording of information can not take advantage of electronic devices to automate such a process. Moreover the assumption that the effort spent to collect a vector of data is feature independent could be no more sustainable. For example in the agriculture do- main a biological test to fill a feature that describes the presence of a particular pest could be really expensive. The objective of a data acquisition plan is twofold: to increase the opportunity of a much more accurate model for the next step of data analysis and at the same time to lower the costs associated to a data acquisition plan. It is to be remarked that in this work we assume that the space of features to be collected can change step by step. This work aims to define a framework for this kind of challenge as a preliminary step towards the development of working solutions. Therefore this paper doesn’t provide yet a solution to arrange successful data acquisition policies.