Data Mining and Knowledge Discovery, 8, 97–126, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates TAPIO ELOMAA elomaa@cs.helsinki.fi JUHO ROUSU rousu@cs.helsinki.fi Department of Computer Science, P.O. Box 26, FIN-00014, University of Helsinki, Finland Editors: Fayyad, Mannila, Ramakrishnan Received January 30, 2001; Revised November 6, 2002 Abstract. We consider multisplitting of numerical value ranges, a task that is encountered as a discretization step preceding induction and also embedded into learning algorithms. We are interested in finding the partition that optimizes the value of a given attribute evaluation function. For most commonly used evaluation functions this task takes quadratic time in the number of potential cut points in the numerical range. Hence, it is a potential bottleneck in data mining algorithms. We present two techniques that speed up the optimal multisplitting task. The first one aims at discarding cut point candidates in a quick linear-time preprocessing scan before embarking on the actual search. We generalize the definition of boundary points by Fayyad and Irani to allow us to merge adjacent example blocks that have the same relative class distribution. We prove for several commonly used evaluation functions that this processing removes only suboptimal cut points. Hence, the algorithm does not lose optimality. Our second technique tackles the quadratic-time dynamic programming algorithm, which is the best schema for optimizing many well-known evaluation functions. We present a technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm. The method works for all convex and cumulative evaluation functions. Together the use of these two techniques speeds up the multisplitting process considerably. Compared to the baseline dynamic programming algorithm the speed-up is around 50 percent on the average and up to 90 percent in some cases. We conclude that optimal multisplitting is fully feasible on all benchmark data sets we have encountered. Keywords: numerical attributes, optimal partitions, convex functions, boundary points 1. Introduction Numerical data is frequently encountered in the data mining task. In most cases, the used learning algorithms require splitting the numerical domains into two or more intervals (Fayyad and Irani, 1992, 1993; Ching et al., 1995; Dougherty et al., 1995; Fulton et al., 1995; Kohavi and Sahami, 1996; Liu and Setiono, 1997; Hong, 1997; Cerquides and L´ opez de M` antaras, 1997; Ho and Scott, 1997). Depending on the method, this task is encountered either in preprocessing, where the numerical domain is “discretized,” or embedded within, for example, a decision tree learning algorithm. Numerical attribute handling is critical in inductive learning. For instance, the quadratic time complexity of the popular C4.5 decision tree learning algorithm (Quinlan, 1993) for