Selected Prior Research Robert L. Grossman March, 2006 Data Mining Systems • 1996 - scaled tree-based classiﬁers to very large data sets. A fun- damental challenge in data mining is to mine data sets that are so large that they do not ﬁt into a computer’s memory. This is important for a wide variety of applications ranging from homeland defense to identifying fraudulent credit card transactions. One of the most accurate techniques in data mining is tree-based classiﬁers and predictors. Our 1996 paper [16] described a method for computing tree-based classiﬁers on data sets that are too large to ﬁt into a computer’s memory. The ﬁrst idea is to partition the data, build individual trees on each partition, and then combine the trees using an ensemble, or collection, of classiﬁers. The second idea is to use stratiﬁed sampling to oversample rare events and distribute them over the various partitions. This was essentially a variant of a type of sampling called bootstrapping. This technique was implemented in Mag- nify’s 1996 version of the PATTERN data mining system and was called Averaged Classiﬁcation Trees/Averaged Regression Trees or ACT/ART. PATTERN was the ﬁrst data mining system to build very accurate clas- siﬁers on data sets that could not ﬁt into a computer’s memory, allowing classiﬁers in 1996 to be built on terabyte size data sets when memory was measured in megabytes and disks in gigabytes. The 1996 paper by Breiman [1] presented a complementary idea called bagging in which en- sembles of trees are built over small data sets by repeated sampling with replacement (another variant of bootstrapping). Building ensembles of trees via partitioning and appropriate bootstrapping is still considered by many to be the most eﬀective algorithm for detecting rare events in large data sets. • 1997 - decreased the time and cost to deploy new data mining models. Although companies in the 1990’s began quite a few data mining projects, many were not as successful as anticipated. One of the reasons for this lack of success is that although a lot of time and energy was spent building statistical and data mining models, it was often very diﬃcult to deploy these models in operational systems and to update them. In 1995-1997, I worked with members of the Terabyte Challenge Testbed to introduce what are now called scoring engines. The basic idea is to 1