In Proc. of the 28th Annual Conf. of Cog. Science Society, Vancouver, pp. 1275-1280, 2006 Recurrent Networks and Natural Language: Exploiting Self-organization Igor Farkaˇ s (ifarkas@coli.uni-sb.de) Matthew W. Crocker (crocker@coli.uni-sb.de) Department of Computational Linguistics and Phonetics Saarland University, Saarbr¨ ucken, D-66041, Germany Abstract Prediction is believed to be an important cognitive com- ponent in natural language processing. Within connec- tionist approaches, Elman’s simple recurrent network has been used for this task with considerable success, especially on small scale problems. However, it has been appreciated for some time that supervised gradient- based learning models have diﬃculties with scaling up, because their learning becomes very time-consuming for larger data sets. In this paper, we explore an alter- native neural network architecture that exploits self- organization. The prediction task is eﬀectively split into separate stages of self-organized context representation and subsequent association with the next-word target distribution. We compare various prediction models and show, in the task of learning a language generated by stochastic context-free grammar, that self-organization can lead to higher accuracy, faster training, greater ro- bustness and more transparent internal representations, when compared to Elman’s network. Introduction Recurrent neural networks have been traditionally used in various tasks that involve time-dependent data. The best known architecture is the Simple Recurrent Net- work (SRN; Elman, 1990) that has been employed in a variety of tasks, including language learning by pre- diction (e.g. Elman, 1991; Servan-Schreiber et al., 1991; Rohde and Plaut, 1997; Christiansen and Chater, 1999). Supervised learning of temporal dependencies by predic- tion typically involves error gradient learning algorithms of which various forms have been proposed (see Pearl- mutter, 1995, for overview). Despite their considerable success, the supervised learning approaches are diﬃcult to scale up to realistic tasks due to learning complexity. One common aspect of these methods is that via error back-propagation they optimize the internal states of a recurrent network to a particular task. In the prediction task this implies that both internal states and predic- tions are optimized using the same learning mechanism. Here we explore an alternative avenue along which we split the whole task into two subtasks and treat them independently. Hence, we ﬁrst optimize internal states, and then we associate these with desired pre- dictions. Optimizing internal states consists in build- ing temporal context representations and since it is not driven by supervision, it can potentially beneﬁt from self- organization. Self-organized temporal context learning is expected not only to facilitate the learning process but has also been argued to have a greater biological plausibility. There have been a number of unsupervised methods proposed during the last decade (see overview in Barreto et al., 2003; Hammer et al., 2004a). Here we focus on two models, namely Recursive Self-Organizing Map (RecSOM; Voegtlin, 2002) and feedforward Sard- Net (James and Miikkulainen, 1995) that represent, in a sense, complementary approaches to representation of the temporal context. RecSOM has been shown to demonstrate a rich repertoire of dynamic behavior when trained on a complex symbolic sequence such as natural language text (Tiˇ no and Farkaˇ s, 2005). Similarly, it has been shown that SardNet, when added as a parallel input preprocessing module to a supervised recurrent network, enhances the processing capacity of a neural network in a shift-reduce parsing task (Mayberry and Miikkulainen, 1999). Once the context representations are optimized with a chosen self-organizing module, we associate them with desired predictions using a supervised learning module. We tested two such modules. One is a simple counting method that builds independent prediction distributions for all units in the map. The other is a single-layer per- ceptron trained by the error delta rule. Simulation methods Self-organization of temporal context For temporal context learning, we explored two basic self-organizing modules – RecSOM and SardNet – as well as a combination of the two, which we called RecSOM- sard. We describe them all in more detail below. Recursive Self-Organizing Map The architecture of the RecSOM model is shown in Figure 1 (without the top layer). Each map neuron i ∈{1, 2, ..., N } has two weight vectors associated with it: w i ∈R n linked with an n-dimensional input s(t), and c i ∈R N linked with the context y(t − 1) = (y 1 (t − 1),y 2 (t − 1), ..., y N (t − 1)), containing map activations y i (t − 1) from the previous time step. The output of a neuron i at time t is computed as y i (t) = exp(−d i (t)), where d i (t)= α ·‖s(t) − w i ‖ 2 + β ·‖y(t − 1) − c i ‖ 2 with ‖·‖ denoting the Euclidean norm. Parameters α> 0 and β> 0 respectively inﬂuence the eﬀect of the input and the context upon a neuron’s proﬁle. Both weight