W3C.3 Determining the Orders of Feature and Hidden Unit Prunings of Artificial Neural Networks Kietikul JEARANAITANAKIJ Department of Computer Engineering Faculty of Engineering King Mongkut's Institute of Technology Ladkrabang Bangkok, Thailand kjkietik( ,km itl. ac. th Abstract- There is a great deal of research undertaken for pruning away features and hidden units in order to reduce the size of Artificial Neural Networks (ANNs). However, none of these methods mentions about the relationship between the pruned unit and the number of epochs needed for retraining when the unit is pruned away from the network. In this paper, we present two heuristics for determining the pruning orders, which lead to the near smallest number of retraining epochs. The heuristics are based on the employment of the modified information gain calculated from all features in training data. Then, we test our proposed heuristics on an exclusive-or data set. The experimental results show the success of using information gain as a criterion for determining the pruning orders. Keywords- Artificial Neural Networks, pruning order, feature pruning, hidden unit pruning, information gain I. INTRODUCTION An artificial neural network can be defined as a model of reasoning based on the human brain. Among models of ANNs, Backpropagation (BP), developed by Rumelhart et al. [1] in 1986, is the most widely used method for training feed-forward neural networks. However, the typical BP method trains ANNs without any reduction in size, which is sometimes too bulky. In order to produce a compact network, there are two issues that we need to minimize: the number of features of a data set and the number of hidden units. In addition, removing features and hidden units in wrong order may extend the retraining time. Therefore, we focus on the solution to minimize the retraining time. There are a large number of investigations undertaken to reduce the size of ANNs. Belue and Bauer [2] reported several saliency measures in order to select the feature to be removed from the network. Setiono and Liu [3] proposed the network accuracy, by adding a penalty term to the error function of the network, on the training data set as a criterion for determining feature removal. Mozer and Smolensky [4] described a Skeletonization method estimating which unit is the least important according to the smallest effect on the training error and deleting it during training. Sietsma and Dow [5], [6] suggested an interactive method in which they inspect a trained network and identify a hidden unit that has a constant activation over all training patterns. Then, the hidden unit which does not influence the output is pruned away. Murase et Ouen PINNGERN Department of Computer Engineering Faculty of Engineering King Mongkut's Institute of Technology Ladkrabang Bangkok, Thailand kpouen( ,kmitl. ac. th al. [7] measured the Goodness Factors of the hidden units in the trained network. The unit which has the lowest value of the Goodness Factor is removed from the hidden layer. Hagiwara [8] presented the Consuming Energy and the Weights Power methods for removal of both hidden units and weights, respectively. Among these methods to reduce the size of ANNs, none mentions about the order of the unit pruning that can lead to the near minimum retraining time. In this paper, we propose two heuristics for determining the pruning orders of features and hidden units in ANNs. These heuristics employ a power of information gain as a criterion for pruning orders. We perform experiments on some variations of the exclusive-or problem. The outputs show that the proposed heuristics prune away unnecessary features and hidden units in ANNs, resulting in a remarkable retraining time. The rest of this paper is organized into the following orders. In Section 2, we explain a modified version of information gain in order to classify the continuous data set. In Section 3, we employ information gain to prune away features and hidden units. Next, in Section 4, we describe the experimental study, data sets used, and experimental results. Finally, in Section 5, we summarize our findings and suggest possible directions for future investigations. II. MODIFICATION OF INFORMATION GAIN We begin by defining a modification of information gain in order to classify a continuous data set. Entropy, a measure commonly used in the information theory, characterizes the (im)purity of an arbitrary collection of examples. Given a collection S, containing examples with each of the C outcomes, the entropy of S is Entropy (S) = E [- p( ) log 2 p( ), IeC (1) where p(I) is the proportion of S belonging to class I. Note that S is not a feature but an entire sample set. Entropy is 0 if all members of S belong to the same class. The scale of the entropy is 0 (purity) to 1 (impurity). The next measure is an information gain. This was first defined by Shannon and 0-7803-9282-5/05/$20.00 )2005 IEEE 353 ICICS 2005