Applied Intelligence
https://doi.org/10.1007/s10489-020-02032-4
Learning flat representations with artificial neural networks
Vlad Constantinescu
1,2
· Costin Chiru
1,2
· Tudor Boloni
3
· Adina Florea
2
· Robi Tacutu
1,4
Accepted: 21 October 2020
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
In this paper, we propose a method of learning representation layers with squashing activation functions within a deep
artificial neural network which directly addresses the vanishing gradients problem. The proposed solution is derived from
solving the maximum likelihood estimator for components of the posterior representation, which are approximately Beta-
distributed, formulated in the context of variational inference. This approach not only improves the performance of deep
neural networks with squashing activation functions on some of the hidden layers - including in discriminative learning - but
can be employed towards producing sparse codes.
Keywords Learning representations · Infomax · Beta distribution · Vanishing gradients
1 Introduction
Currently, most deep neural network models tend to avoid
squashing activation functions (AF), such as the logistic
sigmoid or the hyperbolic tangent [1]. This is because
the Stochastic Gradient Descent (SGD) optimization,
implemented as back-propagation, has been found to be
ineffective due to the vanishing derivatives when the
function reaches its saturation region [2]. For feed-forward
neural networks, the typical approach relies on various non-
squashing AFs, which are less prone to vanishing gradients
for activations in (0, ∞), such as the Rectified Linear Unit
[3, 4], softplus activation [4] or the Exponential Linear Units
[5].
However, these approaches come with a main disad-
vantage, namely the gradients may explode in the positive
Vlad Constantinescu
vlad.ion.constantinescu@gmail.com
Robi Tacutu
robi.tacutu@gmail.com
1
Systems Biology of Aging Group, Institute of Biochemistry
of the Romanian Academy, Bucharest, Romania
2
Computer Science and Engineering Department,
University Politehnica of Bucharest, Bucharest, Romania
3
AITIAOne Inc., 2531 Piedmont Ave., Montrose, CA,
United States
4
Chronos Biosystems SRL, Bucharest, Romania
region of the AFs’ inputs and thus additional measures have
to be considered to regularize them. Moreover, although
ReLU is the most popular AF, it has one additional prob-
lem - its gradients will become 0 in the negative region,
resulting in some weights stopping to be adjusted. This
means that parts of the network will stop responding to error
variations at some point during training. By contrast, if a
technical solution to the vanishing gradient problem would
exist, squashing functions could present significant advan-
tages, mainly due to their nonlinear nature, with a smooth
analog activation and a bounded output range.
Considering the above, in this work we revisit the tech-
nical problem of vanishing gradients and present a method
able to alleviate the main disadvantage of the squashing
functions. The method proposes an alternative formulation
of the canonically-described vanishing gradients problem,
allowing the manipulation of the distributions p(z) of latent
variables z as a product of Beta distributions. The intro-
duction of a surrogate regularization loss serves to address
vanishing gradients as a problem of maximizing the entropy
H (z) of the intermediary z layer, and using approximate
inference to solve it. Since the method involves control-
ling the distribution of latent variables in the context of
maximizing entropy, we propose to validate it mainly for
unsupervised learning representations with deep autoen-
coders. Additionally however, we also show that the method
may be used for solving discriminative inference problems
or recurrent speech recognition, as long as the dimensional-
ity of the z layer on which the surrogate regularization loss
is applied is much smaller than that of the input variable
(d<<n).