662 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 4, JULY 1998 Simulated Annealing and Weight Decay in Adaptive Learning: The SARPROP Algorithm Nicholas K. Treadgold and Tamas D. Gedeon Abstract—A problem with gradient descent algorithms is that they can converge to poorly performing local minima. Global optimization algorithms address this problem, but at the cost of greatly increased training times. This work examines combining gradient descent with the global optimization technique of sim- ulated annealing (SA). Simulated annealing in the form of noise and weight decay is added to resiliant backpropagation (RPROP), a powerful gradient descent algorithm for training feedforward neural networks. The resulting algorithm, SARPROP, is shown through various simulations not only to be able to escape local minima, but is also able to maintain, and often improve the training times of the RPROP algorithm. In addition, SARPROP may be used with a restart training phase which allows a more thorough search of the error surface and provides an automatic annealing schedule. Index Terms—Adaptive, backpropagation algorithm, gradient descent, neural network, RPROP, simulated annealing, weight decay. I. INTRODUCTION T HERE ARE two traditional methods for training feed- forward neural networks: gradient descent and global optimization. The most commonly used method is gradient de- scent, which includes algorithms such as backpropagation [1], conjugate gradient methods [2], and the Levenberg–Marquardt algorithm [3]. Conjugate gradient methods and algorithms us- ing the Hessian, such as the Levenberg–Marquardt algorithm, generally converge to minima more rapidly than methods based only on steepest descent, such as backpropagation [2], [3]. Backpropagation, however, has the advantage of being less computationally expensive for a given size network; a factor which becomes much more important for larger networks. There are a number of algorithms which improve on backpropagation’s convergence properties while maintaining its computational simplicity [4]–[8]. One problem inherent with gradient descent methods is their convergence to local minima. While some local minima can provide solutions which are acceptable, they often result in poor performance. This problem can be overcome through the use of global optimization. In the field of neural networks some global optimization algorithms which have been employed include simulated annealing (SA) [9]–[11], evolutionary meth- ods [12], [13], random methods [14], [15], and deterministic Manuscript received December 17, 1996; revised March 15, 1998. The authors are with the Department of Information Engineering, School of Computer Science and Engineering, The University of New South Wales, Sydney N.S.W. 2052, Australia. Publisher Item Identifier S 1045-9227(98)04454-3. searches [16]. Global optimization, however, has the problem of being computationally expensive, particularly for large networks. Both gradient descent and global optimization methods have inherent problems: gradient descent can converge to poorly performing local minima, and global optimization is compu- tationally expensive. In order to overcome these problems, hybrid methods employing gradient descent and some form of global optimization have been examined [17]–[19]. This com- bination of techniques often results in improved convergence times compared to the standard global optimization, and in some cases even maintains the guarantee of convergence to a global minimum [17], [19]. In this paper, we look at combining a quick and computa- tionally cheap gradient descent algorithm, resilient backpropa- gation (RPROP) [8], [20], with the global search technique of SA. The aim of this combination of techniques is to maintain quick convergence using a computationally cheap algorithm, while reducing the likelihood of convergence to poor local minima. This paper is organized as follows. First, the RPROP algorithm is described and the reasons for its choice as the gradient descent method are discussed. Next, the SA enhance- ments made to RPROP to obtain the SARPROP algorithm are given. The benefits of using SARPROP with a restart training phase are also discussed through the introduction of the ReSARPROP algorithm. The results of comparative simulations between RPROP, SARPROP, and ReSARPROP are then presented and discussed, and conclusions drawn. II. RESILIENT BACKPROPAGATION There have been a number of refinements made to the backpropagation (BP) algorithm. One the most successful in terms of convergence is RPROP [8], [20]. Not only is RPROP one of the faster converging BP variants, its also has the important advantage of having only a single user set parameter. In addition, this single parameter is relatively invariant to its initial value since it is adapted quickly by RPROP to suit the problem [8]. This is a great advantage over many other BP variants which have a number of parameters whose setting often greatly influences the performance of the algorithm on a given problem [8]. A further advantage of RPROP is that it maintains the computational simplicity of BP. There are two major differences between BP and RPROP. First, RPROP modifies the size of the weight step taken adaptively, and second, the mechanism for adaptation in RPROP does not take into account the magnitude of the gradient as seen by a particular weight, but only 1045–9227/98$10.00 1998 IEEE