Progressive Modeling Wei Fan Haixun Wang Philip S. Yu Shaw-hwa Lo Salvatore Stolfo IBM T.J.Watson Research Dept. of Statistics, Columbia Univ. Dept. of Computer Science, Columbia Univ. Hawthorne, NY 10532 New York, NY 10027 New York, NY 10027 weifan,haixun,psyu @us.ibm.com slo@stats.columbia.edu sal@cs.columbia.edu Abstract Presently, inductive learning is still performed in a frus- trating batch process. The user has little interaction with the system and no control over the final accuracy and train- ing time. If the accuracy of the produced model is too low, all the computing resources are misspent. In this paper, we propose a progressive modeling framework. In progressive modeling, the learning algorithm estimates online both the accuracy of the final model and remaining training time. If the estimated accuracy is far below expectation, the user can terminate training prior to completion without wasting further resources. If the user chooses to complete the learn- ing process, progressive modeling will compute a model with expected accuracy in expected time. We describe one implementation of progressive modeling using ensemble of classifiers. Keywords: estimation 1 Introduction Classification is one of the most popular and widely used data mining methods to extract useful information from data- bases. ISO/IEC is proposing an international standard to be finalized in August 2002 to include four data mining types into database systems; these include association rules, clus- tering, regression, and classification. Presently, classifica- tion is performed in a “capricious” batch mode even for many well-known commercial data mining software. An inductive learner is applied to the data; before the model is completely computed and tested, the accuracy of the final model is not known. Yet, for many inductive learning algo- rithms, the actual training time is not known prior to learn- ing either. It depends on not only the size of the data and the number of features, but also the combination of feature val- ues that utimately determines the complexity of the model. During this possibly long waiting period, the only interac- tion between the user and program is to make sure that the program is still running and observe some status reports. If the final accuracy is too low after some long training time, Figure 1. An interactive scenario where both accuracy and remaining training time are es- timated all the computing resources become futile. The users ei- ther have to repeat the same process using other parame- ters of the same algorithm, choose a different feature sub- set, select a completely new algorithm or give up. There are many learners to choose from, a lot of parameters to select for each learner, countless ways to construct features, and exponential ways for feature selection. The unpredictable accuracy, long and hard-to-predict training time, and end- less ways to run an experiment make data mining frustrating even for experts. 1.1 Example of Progressive Modeling In this paper, we propose a “progressive modeling” con- cept to address the problems of batch mode learning. We illustrate the basic ideas through a cost-sensitive example even though the concept is applicable to both cost-sensitive and traditional accuracy-based problems. We use a charity donation dataset (KDDCup 1998) that chooses a subset of population to send campaign letters. The cost of a campaign letter is $0.68. It is only beneficial to send a letter if the solicited person will donate at least $0.68. As soon as learning starts, the framework begins to compute intermediate models, and report current accuracy as well as estimated final accuracy on a hold-out valida- tion set and estimated remaining training time. For cost- sensitive problem, accuracy is measured in benefits such as dollar amounts. We use the term accuracy to mean tra- ditional accuracy and benefits interchangeably where the meaning is clear from the context. Figure 1 shows a snap 1