Neural Processing Letters 6: 51–59, 1997. 51 c 1997 Kluwer Academic Publishers. Printed in the Netherlands. An Efficient Partition of Training Data Set Improves Speed and Accuracy of Cascade-correlation Algorithm IGOR V. TETKO 12 and ALESSANDRO E.P. VILLA 1 1 Laboratoire de Neuro-heuristique, Institut de Physiologie, UNIL, Rue du Bugnon 7, 1005 Lausanne, Switzerland; 2 Department of Biomedical Applications Institute of Bioorganic and Petroleum Chemistry, Murmanskaya 1, Kiev–660, 253660, Ukraine E-mail: alessandro.villa@iphysiol.unil.ch Key words: algorithm, cascade correlation, early stopping, efficient partition of training data set Abstract. This study extends an application of efficient partition algorithm (EPA) for artificial neural network ensemble trained according to Cascade Correlation Algorithm. We show that EPA allows to decrease the number of cases in learning and validated data sets. The predictive ability of the ensemble calculated using the whole data set is not affected and in some cases it is even improved. It is shown that a distribution of cases selected by this method is proportional to the second derivative of the analyzed function. 1. Introduction A learning algorithm based on a combination of Early Stopping and Ensemble averaging technique (ESE) was recently introduced [1, 2]. This algorithm proposed a simple but powerful method to avoid an overfitting/overtraining problem in Artificial Neural Network (ANN) so that its prediction ability of ANNs was better than other methods. A crucial question remained unsolved: ‘How many cases should be taken for the learning and validated data sets?’ A general answer to this question can be hardly investigated theoretically (e.g., see Amari et al [3]). This paper addresses the following simpler question: ‘Are all cases from an initial training data set necessary for learning without decreasing the prediction ability of a ANN ensemble (ANNE)?’ To reduce the number of cases for the learning and validated data sets can significantly increase the speed of the algorithm especially with large data set analysis. The method considered in this paper is based on the analysis of the sensitivity of neural networks to small perturbation in their training data set and initial random weight initialization. It is known that even identical learning data sets and basic parameters (e.g., the number of neurons and the connectivity of the network) may provide different calculated results due to sensitivity of ANN to initial weights initialization. Ensemble averaging techniques (e.g., bagging [4], ESE [1]) signifi-