Modular Deep Belief Networks that do not Forget Leo Pape, Faustino Gomez, Mark Ring and J¨ urgen Schmidhuber Abstract— Deep belief networks (DBNs) are pop- ular for learning compact representations of high- dimensional data. However, most approaches so far rely on having a single, complete training set. If the distribution of relevant features changes during subsequent training stages, the features learned in earlier stages are gradually forgotten. Often it is desirable for learning algorithms to retain what they have previously learned, even if the input distribution temporarily changes. This paper introduces the M- DBN, an unsupervised modular DBN that addresses the forgetting problem. M-DBNs are composed of a number of modules that are trained only on samples they best reconstruct. While modularization by itself does not prevent forgetting, the M-DBN additionally uses a learning method that adjusts each module’s learning rate proportionally to the fraction of best reconstructed samples. On the MNIST handwritten digit dataset module specialization largely corresponds to the digits discerned by humans. Furthermore, in several learning tasks with changing MNIST digits, M- DBNs retain learned features even after those features are removed from the training data, while monolithic DBNs of comparable size forget feature mappings learned before. I. I NTRODUCTION D EEP BELIEF NETWORKS (DBNs; [1]) are popular for learning compact representations of high-dimensional data. DBNs are neural net- works consisting of a stack of Boltzmann machine layers that are trained one at a time, in an un- supervised fashion to induce increasingly abstract representations of the inputs in subsequent layers. This layer-by-layer training procedure facilitates supervised training of deep networks, which are in principle more efficient at learning compact rep- resentations of high-dimensional data than shallow architectures [2], but are also notoriously difficult to train with traditional gradient methods (e.g., [3]). DBNs can be particularly useful as sensory pre- processors for learning agents that interact with an environment that requires learning complex ac- tion mappings from high-dimensional inputs. Often these input spaces are embedded within a vast state space where the input distribution may vary widely between regions. Rather than assemble a single, monolithic training set covering all eventualities, it is more efficient to train an agent incrementally such All authors are at IDSIA, University of Lugano, SUPSI, Lugano, Switzerland. Email: {pape, tino, mark, juergen}@idsia.ch. This work was supported by the EU under contract numbers FP7-ICT-IP-231722 and FP7-NMP-228844. that it can build upon what it learned previously. This continual learning paradigm [4] demands that the underlying learning algorithms support the re- tention of earlier training. While DBNs (and the related approach of stacked autoencoders) have been successfully applied to many tasks [1, 2, 5–7], most approaches rely on training data that are sampled from a stationary distribution. However, in continual learning, where the statistics of the training data change over time, DBNs, like most connectionist approaches, gradually forget previously learned rep- resentations as new input patterns overwrite those patterns that become less probable. A possible remedy to forgetting is to split a monolithic network into a number of expert mod- ules, each of which specializes on a subset of the task. In such an ensemble approach, expert modules are trained to improve their performance only on those subtasks they are already good at, ignore the subtasks of other experts, and thereby protect their own weights from corruption by unrelated input patterns. Jacobs et al. [8] introduced a supervised method for training local experts in which a gating network is trained to assign each input pattern to the expert that produced the lowest output error. An unsupervised version of this algorithm described in [9] uses the reconstruction error of each module on a sample to train the modules with the best reconstruction to become even better, and, option- ally, to train the other modules to become worse in reconstructing that sample. While these methods divide a task over multiple modules, they contain no mechanism for preventing an expert module from shifting its expertise when the statistics of the training data change over time — even a single deviating sample can significantly alter a module’s expertise when the subtask in which the module specialized disappears from the training data. This paper presents the modular DBN (M-DBN), an unsupervised method for training expert DBN modules that avoids catastrophic forgetting when the dataset changes. Similar to [9], only the module that best reconstructs a sample gets trained. In ad- dition, M-DBNs use a batch-wise learning scheme in which each module is updated in proportion to the fraction of samples it best reconstructs. If that fraction is less than a small threshold value, the module is not trained at all. The experimental results demonstrate that these modifications to the original DBN are sufficient to facilitate module specializa-