Gamma-filter self-organising neural networks for unsupervised sequence processing P.A. Este ´vez, R. Herna ´ndez, C.A. Perez and C.M. Held Adding g-filters to self-organising neural networks for unsupervised sequence processing is proposed. The proposed g-context model is applied to self-organising maps and neural gas networks. The g-context model is a generalisation that includes as a particular example the previously published merge-context model. The results show that the g-context model outperforms the merge-context model in terms of temporal quantisation error and state-space representation. Introduction: Self-organising feature maps (SOMs) [1] have been used extensively for clustering and data visualisation of static data. Recently, SOMs have been extended to sequence processing by using recurrent connections and temporal or spatial context representations; see [2] for a review. Engineering applications include the analysis of temporally or spatially connected data, such as DNA chains, biomedical signals, speech and image processing, and time series in general. In the merge SOM (MSOM) model [3], each neuron is associated with a weight vector and a context vector. The merge context model can be combined with neural gas (NG) [4], creating the merge neural gas (MNG) model [5]. In this Letter, adding g-filters [6] to self-organising neural networks for unsupervised sequence processing is proposed. The g-context model is applied to extend SOM and NG networks. Method: We describe our method using the NG network model. A reference vector w i [ < d is associated with the ith neuron, for i = 1, ··· , M , where d is the dimensionality of the input space and M is the number of neurons. A set of g-filter vector contexts C = {c i 1 , c i 2 , ··· , c i K } is added to each ith neuron, where c i k [ < d for k = 1, ··· , K; i = 1, ··· , M , and K is the highest order g-filter. When a data vector x(n) is presented to the network at time n, the best matching unit (BMU), I n , is determined by computing the closest neuron accord- ing to the following recursive distance criterion or distortion error: d i (n)= a 0 ‖x(n)− w i ‖ 2 + ∑ K k=1 a k ‖C k (n)− c i k ‖ 2 (1) where the parameters a k , for k = 0, ··· , K, control the contribution of reference and context vectors. Because the context construction is recur- sive, setting a 0 . a 1 . ··· . a K . 0 is recommended. The K descrip- tors of the current context, C k (n), are recursively computed as the linear combination of the contexts of the kth and (k 2 1)th order g-filter associ- ated with the BMU obtained at the previous time step, I n−1 , i.e. C k (n) = bc In−1 k + (1 − b) c In−1 k−1 ∀k = 1, ··· , K (2) where c In−1 0 = w In−1 , and the initial conditions c I0 k , ∀k = 1, ··· , K, are set randomly. The parameter b [ (0, 1) provides a mechanism to decouple depth, D, and resolution, R, from the filter order. Depth indicates how far into the past the memory stores information. Resolution indicates the degree to which the information relative to the individual elements of the sequence is preserved. The mean memory depth for a g-filter of order K is D = K/(1 − b), and its resolution is R = 1 − b [6]. By increasing K, we can achieve a greater depth without compromising res- olution. It is easy to verify that for K ¼ 1 in (2), i.e. when only a single g-filter stage is used, the merge context model used in MSOM [3] and MNG [5] is recovered. When an input vector x is presented, we determine the BMU, the second closest neuron, and so on. The ranking index associated with each ith neuron is denoted as r i [ {0, ··· , M − 1}. The neighbourhood function is defined in input space as h l (r i ) = exp (−r i /l) ∀i = 1, ··· , M , where l defines the neighbourhood size. The cost func- tion for the g-NG model is defined as: E g−ng (w, c 1 , ··· , c K , l)= 1 2H (l) ∑ M i=1 V P(x)h l (r i ) d i (x, w i , c i 1 , ··· , c i K ) dx (3) where V # < d is the input space, P(x) is the probability distribution of data vectors over the manifold V; h l is the neighbourhood function; d i is the distortion error, and H (l) is a normalisation factor, H (l) = ∑ M i=1 h l (r i ). When a k = 0, ∀k = 1, ··· , K, in (1), the g-NG cost func- tion (3) reduces to the original NG cost function [4]. Minimising (3) by stochastic gradient descent we obtain the following update rules for reference and context vectors: Dw i = 1 w (n)h l (k i )(x(n) − w i ) Dc i k = 1 c (n)h l (k i )(C k (n) − c i k ) (4) where the learning rates 1 w , 1 c , as well as l, decrease monotonically from an initial value, v 0 , to a final value, v f , as v (n) = v 0 (v f /v 0 ) n/nmax , where n max is the maximum number of iterations. Temporal quantisation error (TQE) [7] is used to assess the represen- tation of temporal dependencies in the map. The quantisation error of a neuron corresponds to the standard deviation associated with its recep- tive field. The TQE of the entire map for t time-steps back into the past is defined as the average quantisation error measured over all neurons. Recurrence plots are useful tools for visualising recurrences of dynamical systems in the phase space [8]. A recurrence matrix is defined where the ijth component is one if the states of the system at times i and j are similar up to an error 1, otherwise the corresponding entry in the matrix is zero. The parameter 1 is fixed so that the recurrence rate is 0.02 [8]. The g-NG model can be interpreted as a quantised recon- struction of the phase space in the K + 1 dimensional space by defining the vector r (t)= [w In , c In 1 , ··· , c In K ]. g-NG algorithm: 1. Initialise randomly reference vectors w i and context vectors c i k , i = 1, ··· , M ; k = 1, ··· , K. 2. Present input vector, x(n), to the network 3. Calculate context descriptors C k (n) using (2) 4. Find the BMU, I n , using (1) 5. Update reference and context vectors using (4). 6. Set n n + 1 7. If n , L go back to step 2, where L is the cardinality of the data set. The g-context model can easily be added to SOM, by replacing the neighbourhood function h l (k i ) in (4) by a neighbourhood function defined in the 2D output grid, typically a Gaussian centred in the winner unit i ∗ , h s (d G (i, i ∗ )) = exp(−d G (i, i ∗ )/s(n)), where d G is the Manhattan distance, and s is the neighbourhood size; see [9] for more details on g-SOM. Experiments were carried out with two data sets: Mackey-Glass time series [3] and Ro ¨ssler’s chaotic system [8]. Parameter b in (2) was varied from 0.1 to 0.9 with 0.1 steps. Likewise, the number of filter stages, K, was varied from 1 to 9. The number of neurons for both SOM (square grid) and NG variants was set to M = 0.1L where L is the length of the time series. Training is done in two stages lasting 100 epochs each. In the first stage, the parameters are set to a 0 = 0.5 and a k = 0.5/K, ∀k = 1, ··· , K in (1). This is done to favour a good con- vergence of the reference vectors in the first stage. In the second training stage, the parameters are set to a k = (K + 1 − k )/A, ∀k = 0, ··· , K, where A is determined by imposing the constraint ∑ k a k = 1. The initial and final values of parameters l, 1 w , 1 c were set as follows: l 0 = 0.1M , l f = 0.01, 1 w0 = 0.3, 1 wf = 0.001, 1 c0 = 0.3, 1 cf = 0.001. s was set in the same way as l. 1 w was fixed to 0.001 during the second training phase. 0.25 0.20 0.15 temporal quantisation error 0.10 0.05 0 0 5 10 15 time steps into past 20 25 SOM MSOM (β=0.6) γSOM (k=6,β=0.1) NG 30 Fig. 1 Temporal quantisation error against time-steps into past for Mackey- Glass time series, using SOM, NG, MSOM and g -SOM ELECTRONICS LETTERS 14th April 2011 Vol. 47 No. 8