Selected Prior Research Robert L. Grossman March, 2006 Data Mining Systems 1996 - scaled tree-based classifiers to very large data sets. A fun- damental challenge in data mining is to mine data sets that are so large that they do not fit into a computer’s memory. This is important for a wide variety of applications ranging from homeland defense to identifying fraudulent credit card transactions. One of the most accurate techniques in data mining is tree-based classifiers and predictors. Our 1996 paper [16] described a method for computing tree-based classifiers on data sets that are too large to fit into a computer’s memory. The first idea is to partition the data, build individual trees on each partition, and then combine the trees using an ensemble, or collection, of classifiers. The second idea is to use stratified sampling to oversample rare events and distribute them over the various partitions. This was essentially a variant of a type of sampling called bootstrapping. This technique was implemented in Mag- nify’s 1996 version of the PATTERN data mining system and was called Averaged Classification Trees/Averaged Regression Trees or ACT/ART. PATTERN was the first data mining system to build very accurate clas- sifiers on data sets that could not fit into a computer’s memory, allowing classifiers in 1996 to be built on terabyte size data sets when memory was measured in megabytes and disks in gigabytes. The 1996 paper by Breiman [1] presented a complementary idea called bagging in which en- sembles of trees are built over small data sets by repeated sampling with replacement (another variant of bootstrapping). Building ensembles of trees via partitioning and appropriate bootstrapping is still considered by many to be the most effective algorithm for detecting rare events in large data sets. 1997 - decreased the time and cost to deploy new data mining models. Although companies in the 1990’s began quite a few data mining projects, many were not as successful as anticipated. One of the reasons for this lack of success is that although a lot of time and energy was spent building statistical and data mining models, it was often very difficult to deploy these models in operational systems and to update them. In 1995-1997, I worked with members of the Terabyte Challenge Testbed to introduce what are now called scoring engines. The basic idea is to 1