New Entropy Learning Method for Neural Network Khue Hiang Chan, Geok See Ng, Sevki S. Erdogan and Harcharan Singh School of Applied Science, Nanyang Technological University, Singapore pchankh@hotmail.com, asgsng@ntu.edu.sg, asserdogan@ntu.edu.sg, ashsingh@ntu.edu.sg ABSTRACT In this paper, an entropy penalty term is used to steer the direction of the hidden node’s activation in the process of learning. A state with minimum entropy means that nodes are operating near the extreme values of the Sigmoid curve. As the training proceeds, redundant hidden nodes’ activations are pushed towards their extreme value, while relevant nodes remain active in the linear region of the Sigmoid curve. The early creation of redundant nodes may impair generalisation. To prevent the network from being driven into saturation before it can really learn, an entropy cycle is proposed to dampen the early creation of such redundant nodes. 1. INTRODUCTION The mapping capability of a neural network depends on its structure, that is, the number of layers and hidden nodes. A network that has a structure simpler than necessary cannot give good approximations to the training patterns. A network with more layers and hidden nodes can perform more complicated mappings. However, better performance on unseen data, that is, generalisation ability, implies lower order mapping. Bigger networks also need larger data samples for training. With a smaller network size, less memory is required to store the connection weights and the computational cost of each iteration decreases. In this paper, a total cost function, which includes an entropy penalty term with a novel relaxing cycle called entropy cycle, is proposed during the learning process of the neural networks. At the end of entropy learning, inactive nodes created can be eliminated without affecting the performance of the original network. 2. THE ENTROPY LEARNING METHOD Suppose that an entropy function can be defined with respect to the activity zj of the hidden node j. If the entropy is minimised, only a certain subset of the hidden nodes will be turned on. On the other hand, if the entropy is maximised, all the hidden nodes are nearly equally activated in their linear zone. zj , which is the activation of a hidden nodej, is obtained by zj = f(I,)=L I+& where f is the Sigmoid function and Ij is defined by N,-I Ii = 2 vjixi i=O (2) where xi is from input node i, vji is the weight connection from input node i to hidden node j and N, is the number of input nodes. The activation zj is normalised as follows: ‘i pj = NJ,-1 -r-r (3) L &i i=O where N, is the number of hidden nodes. By using this normalised activation, an entropy function can be formulated by NH-l HE- 2 PjlOgPj. j=O By minimising the entropy term, the hidden nodes’ activation can be forced to develop extreme representation near 0 and I. However, the creation of such nodes in the early stage of network learning may impair generalisation performance. To prevent the network from being driven into saturation before it starts to learn, the III-412