Applied Intelligence https://doi.org/10.1007/s10489-020-02032-4 Learning ﬂat representations with artiﬁcial neural networks Vlad Constantinescu 1,2 · Costin Chiru 1,2 · Tudor Boloni 3 · Adina Florea 2 · Robi Tacutu 1,4 Accepted: 21 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract In this paper, we propose a method of learning representation layers with squashing activation functions within a deep artificial neural network which directly addresses the vanishing gradients problem. The proposed solution is derived from solving the maximum likelihood estimator for components of the posterior representation, which are approximately Beta- distributed, formulated in the context of variational inference. This approach not only improves the performance of deep neural networks with squashing activation functions on some of the hidden layers - including in discriminative learning - but can be employed towards producing sparse codes. Keywords Learning representations · Infomax · Beta distribution · Vanishing gradients 1 Introduction Currently, most deep neural network models tend to avoid squashing activation functions (AF), such as the logistic sigmoid or the hyperbolic tangent [1]. This is because the Stochastic Gradient Descent (SGD) optimization, implemented as back-propagation, has been found to be ineffective due to the vanishing derivatives when the function reaches its saturation region [2]. For feed-forward neural networks, the typical approach relies on various non- squashing AFs, which are less prone to vanishing gradients for activations in (0, ∞), such as the Rectified Linear Unit [3, 4], softplus activation [4] or the Exponential Linear Units [5]. However, these approaches come with a main disad- vantage, namely the gradients may explode in the positive  Vlad Constantinescu vlad.ion.constantinescu@gmail.com  Robi Tacutu robi.tacutu@gmail.com 1 Systems Biology of Aging Group, Institute of Biochemistry of the Romanian Academy, Bucharest, Romania 2 Computer Science and Engineering Department, University Politehnica of Bucharest, Bucharest, Romania 3 AITIAOne Inc., 2531 Piedmont Ave., Montrose, CA, United States 4 Chronos Biosystems SRL, Bucharest, Romania region of the AFs’ inputs and thus additional measures have to be considered to regularize them. Moreover, although ReLU is the most popular AF, it has one additional prob- lem - its gradients will become 0 in the negative region, resulting in some weights stopping to be adjusted. This means that parts of the network will stop responding to error variations at some point during training. By contrast, if a technical solution to the vanishing gradient problem would exist, squashing functions could present significant advan- tages, mainly due to their nonlinear nature, with a smooth analog activation and a bounded output range. Considering the above, in this work we revisit the tech- nical problem of vanishing gradients and present a method able to alleviate the main disadvantage of the squashing functions. The method proposes an alternative formulation of the canonically-described vanishing gradients problem, allowing the manipulation of the distributions p(z) of latent variables z as a product of Beta distributions. The intro- duction of a surrogate regularization loss serves to address vanishing gradients as a problem of maximizing the entropy H (z) of the intermediary z layer, and using approximate inference to solve it. Since the method involves control- ling the distribution of latent variables in the context of maximizing entropy, we propose to validate it mainly for unsupervised learning representations with deep autoen- coders. Additionally however, we also show that the method may be used for solving discriminative inference problems or recurrent speech recognition, as long as the dimensional- ity of the z layer on which the surrogate regularization loss is applied is much smaller than that of the input variable (d<<n).