Mixture of Experts Applied to Nonlinear Dynamic Systems Identification:
A Comparative Study
Clodoaldo Ap. M. Lima, André L. V. Coelho, Fernando J. Von Zuben
Department of Computer Engineering and Industrial Automation (DCA)
School of Eletrical and Computer Engineering (FEEC)
State University of Campinas - Unicamp
{moraes,coelho,vonzuben}@dca.fee.unicamp.br
Abstract
A mixture of experts (ME) model provides a modular
approach wherein component neural networks are made
specialists on subparts of a problem. In this framework,
that follows the "divide-and-conquer” philosophy, a
gating network learns how to softly partition the input
space into regions to be each properly modeled by one or
more expert networks. In this paper, we investigate the
application of different ME variants to some multivariate
nonlinear dynamic systems identification problems which
are known to be difficult to be dealt with. The aim is to
provide a comparative performance analysis between
variable settings of the standard, gated, and localized ME
models with more conventional NN models.
1. Introduction
Mixture of Experts (ME) models [2][3] consist of a family
of modular neural network (NN) approaches which follow
the "divide-and-conquer" strategy of distilling complex
problems into simple subtasks. In a statistical sense, an
ME should be regarded as a mixture model for estimating
conditional probability distributions. In this framework, a
gating network (GN) is in charge of learning how to softly
divide the input space into (possibly overlapping) regions
to be each assigned to one or more expert networks (ENs).
In terms of a mixture model [8][9], EN and GN outputs
correspond, respectively, to conditional component
densities and to input-biased mixture coefficients. Such
interpretation enables an ME to be trained with
Expectation-Maximization (EM) algorithms, and ME
models have produced good results when applied to
classification [4]-[6] and forecasting [7][11] problems.
In this work, we assess the capabilities of some ME
variants for the identification of multivariate nonlinear
dynamic systems (IMNDS, for short). Contrasted with the
identification of linear plants, which is typically done by
designing (via the superposition principle) a transfer
function modeling the plant behavior, IMNDS are more
laborious, since the input-output relation may depend
nonlinearly on past state values and the plant may be
sensitive to the current points of operation. The extraction
of a global NN model from data for tackling IMNDS
problems is also very difficult [1][13] as the network
generalization in some noisy regimes may be hampered
(over- or under-fitting) by the dynamic switching between
regimes [7]. By other means, employing distinct experts in
different regions comes to be a more fruitful effort since
each expert can focus on the subset of data inputs that is
more relevant to its specific working area.
Besides the standard ME model introduced by Jacobs
et al. [2][3], which has a single-layer perceptron with soft-
max activation function as GN, the ME variants we apply
for IMNDS experiments are: (i) the gated experts (GE),
devised by Weigend et al. [7]; and (ii) the localized ME
(LME), formulated by Xu et al. and others ([5][10][12]).
A GE (or society of experts) structure employs non-linear
experts with a non-linear gating network (set as multilayer
perceptrons–MLPs) for the non-linear decomposition of
the input space. Conversely, LME models use normalized
Gaussian kernels for the division of the input space by
means of soft hyper-ellipsoids (i.e., localized regions
assigned to the experts).
The organization of this paper is as follows. Section 2
brings more considerations on the ME model and its
variants. In Section 3, we show how to configure NNs
(MEs) for IMNDS problems, and in Section 4, we
compare the performance of ME variants on four IMNDS
experiments. Section 5 brings final discussion.
2. Mixture of Experts and its Variants
In the standard ME framework (Fig. 1), we have a set of
expert networks j=1,...,m, all of which looking at the input
vector x to form mapping outputs y
j
. There is only one
gating network, also looking at the input, but that, instead,
produces outputs g
j
≥0, ∑
j
g
j
= 1, which weight the expert
network outputs to form the overall ME output
∑
=
=
m
j
j j
x y x g x y
1
) ( ) ( ) ( . (1)
Each output g
j
should be viewed as the probability of
assigning input x to expert j. To be compliant with such
interpretation, the activation functions for the gating
outputs (known as soft-max functions) are given by:
( )
∑
=
=
m
i
i
j
j
x z
x z
x g
1
) ( exp
)) ( exp(
) (
, (2)
where z
i
's (normalized exponentials) are gating outputs
before thresholding. This choice makes the experts more
competitive and brings about the constraints that the
outputs should be positive and sum to unity.
Proceedings of the VII Brazilian Symposium on Neural Networks (SBRN’02)
0-7695-1709-9/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 28, 2008 at 18:38 from IEEE Xplore. Restrictions apply.