Mixture of Experts Applied to Nonlinear Dynamic Systems Identification: A Comparative Study Clodoaldo Ap. M. Lima, André L. V. Coelho, Fernando J. Von Zuben Department of Computer Engineering and Industrial Automation (DCA) School of Eletrical and Computer Engineering (FEEC) State University of Campinas - Unicamp {moraes,coelho,vonzuben}@dca.fee.unicamp.br Abstract A mixture of experts (ME) model provides a modular approach wherein component neural networks are made specialists on subparts of a problem. In this framework, that follows the "divide-and-conquerphilosophy, a gating network learns how to softly partition the input space into regions to be each properly modeled by one or more expert networks. In this paper, we investigate the application of different ME variants to some multivariate nonlinear dynamic systems identification problems which are known to be difficult to be dealt with. The aim is to provide a comparative performance analysis between variable settings of the standard, gated, and localized ME models with more conventional NN models. 1. Introduction Mixture of Experts (ME) models [2][3] consist of a family of modular neural network (NN) approaches which follow the "divide-and-conquer" strategy of distilling complex problems into simple subtasks. In a statistical sense, an ME should be regarded as a mixture model for estimating conditional probability distributions. In this framework, a gating network (GN) is in charge of learning how to softly divide the input space into (possibly overlapping) regions to be each assigned to one or more expert networks (ENs). In terms of a mixture model [8][9], EN and GN outputs correspond, respectively, to conditional component densities and to input-biased mixture coefficients. Such interpretation enables an ME to be trained with Expectation-Maximization (EM) algorithms, and ME models have produced good results when applied to classification [4]-[6] and forecasting [7][11] problems. In this work, we assess the capabilities of some ME variants for the identification of multivariate nonlinear dynamic systems (IMNDS, for short). Contrasted with the identification of linear plants, which is typically done by designing (via the superposition principle) a transfer function modeling the plant behavior, IMNDS are more laborious, since the input-output relation may depend nonlinearly on past state values and the plant may be sensitive to the current points of operation. The extraction of a global NN model from data for tackling IMNDS problems is also very difficult [1][13] as the network generalization in some noisy regimes may be hampered (over- or under-fitting) by the dynamic switching between regimes [7]. By other means, employing distinct experts in different regions comes to be a more fruitful effort since each expert can focus on the subset of data inputs that is more relevant to its specific working area. Besides the standard ME model introduced by Jacobs et al. [2][3], which has a single-layer perceptron with soft- max activation function as GN, the ME variants we apply for IMNDS experiments are: (i) the gated experts (GE), devised by Weigend et al. [7]; and (ii) the localized ME (LME), formulated by Xu et al. and others ([5][10][12]). A GE (or society of experts) structure employs non-linear experts with a non-linear gating network (set as multilayer perceptrons–MLPs) for the non-linear decomposition of the input space. Conversely, LME models use normalized Gaussian kernels for the division of the input space by means of soft hyper-ellipsoids (i.e., localized regions assigned to the experts). The organization of this paper is as follows. Section 2 brings more considerations on the ME model and its variants. In Section 3, we show how to configure NNs (MEs) for IMNDS problems, and in Section 4, we compare the performance of ME variants on four IMNDS experiments. Section 5 brings final discussion. 2. Mixture of Experts and its Variants In the standard ME framework (Fig. 1), we have a set of expert networks j=1,...,m, all of which looking at the input vector x to form mapping outputs y j . There is only one gating network, also looking at the input, but that, instead, produces outputs g j 0, j g j = 1, which weight the expert network outputs to form the overall ME output = = m j j j x y x g x y 1 ) ( ) ( ) ( . (1) Each output g j should be viewed as the probability of assigning input x to expert j. To be compliant with such interpretation, the activation functions for the gating outputs (known as soft-max functions) are given by: ( ) = = m i i j j x z x z x g 1 ) ( exp )) ( exp( ) ( , (2) where z i 's (normalized exponentials) are gating outputs before thresholding. This choice makes the experts more competitive and brings about the constraints that the outputs should be positive and sum to unity. Proceedings of the VII Brazilian Symposium on Neural Networks (SBRN’02) 0-7695-1709-9/02 $17.00 © 2002 IEEE Authorized licensed use limited to: IEEE Xplore. Downloaded on November 28, 2008 at 18:38 from IEEE Xplore. Restrictions apply.