4th Neural Coding Workshop, Plymouth, UK 10-15 September 2001 (Times new Roman, font 10, centred) 1234 Noise, Pseudopatterns, and Information Transfer in the Brain Robert M. French Quantitative Psychology & Cognitive Science Psychology Department University of Liège, 4000 Liège, Belgium rfrench@ulg.ac.be Nick Chater Institute for Applied Cognitive Science Department of Psychology University of Warwick Coventry, CV4 7AL, UK nick.chater@warwick.ac.uk 1. Overview Connectionist learning methods for distributed feedforward networks typically result in “catastrophic forgetting.” That is, as new information is learned, old information is rapidly obliterated. This contrasts with the much more gradual forgetting observed in human memory, thus casting doubt on the cognitive plausibility of connectionist learning methods. It has been shown previously that paired neural networks that are “cross-trained” in a rather intricate manner can address this problem (French, 1997; Ans & Rousset, 1997). Here, we show that a simpler single-network approach that makes use only of noise passing through the network can also significantly reduce catastrophic interference. We speculate that this kind of method might be involved in human learning. 2. The Hessian pseudopattern backpropagation algorithm When a neural network learns (perfectly) a set of patterns, N i i i i O I P 1 } : { = → , this defines a unique error surface, ) ( w E , with respect to the weights of the network. Learning the set of patterns means that the network has found a local minimum 0 w in weight-space for which 0 ) ( 0 = w E and 0 ) ( 0 = ′ w E where ) ( w E ′ represents the first derivative of the error function. The problem is that when new patterns are subsequently learned by the network, a new error surface is created. If the previously learned patterns have not been interleaved with the new patterns, the point in weight-space corresponding to a minimum of the error surface associated with the new patterns may not correspond at all to an error minimum for the previously learned patterns. Hence, catastrophic forgetting of the old patterns. Now assume that the original patterns N i i P 1 } { = are no longer available for interleaving with the new patterns, but that we would nonetheless like to approximate the original error surface ) ( w E . If the function f underlying the original set of patterns is relatively “nice” (i.e., continuous, reasonably smooth, etc.), then by generating a set of pseudopatterns M i i i i O I 1 } : { = → ψ whose input values are drawn from a random distribution and whose associated output is simply the result of passing the random input through the network, we can produce a reasonable approximation of f (Robins, 1995). Just as the original set of patterns N i i P 1 } { = had a unique error surface associated with it, the same is true for the set of pseudopatterns M i i 1 } { = ψ . We will call this latter error surface ) ( w E . It follows from the definition of pseudopatterns that 0 ) ( 0 = w E and 0 ) ( 0 = ′ w E . The question is how to develop an approximation of this error surface in the vicinity of 0 w without having recourse to the original patterns. We develop a Taylor series expansion of ) ( w E , which requires the second derivative of ) ( w E (i.e., the Hessian), but, unlike the first derivative, the Hessian does not disappear when evaluated at 0 w . Once we have the second derivative, we immediately obtain ) ( w E , the desired approximation of the original error surface.