Published on Mon Sep 24 2018

Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrzębski, Jacek Tabor, Maciej A. Nowak

We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used. We corroborate our results with numerical simulations of both random matrices and ResNets applied to the CIFAR-10 classification problem.

0
0
0
Abstract

We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by analyzing the signal propagation in the artificial neural network. We corroborate our results with numerical simulations of both random matrices and ResNets applied to the CIFAR-10 classification problem. Moreover, we study the consequence of this universal behavior for the initial and late phases of the learning processes. We conclude by drawing attention to the simple fact, that initialization acts as a confounding factor between the choice of activation function and the rate of learning. We propose that in ResNets this can be resolved based on our results, by ensuring the same level of dynamical isometry at initialization.

Fri May 03 2019
Machine Learning
Static Activation Function Normalization
static activation normalization provides a first step toward giving benefits similar in spirit to schemes like batch normalization, but without computational cost. It significantly promotes convergence robustness, maximum training depth, and anytime performance.
0
0
0
Fri Feb 12 2021
Machine Learning
Applicability of Random Matrix Theory in Deep Learning
We investigate the local spectral statistics of the loss surface Hessians ofificial neural networks. We find excellent agreement with Gaussian Orthogonal Ensemble statistics across several network architectures and datasets. We propose a novel model for the true loss surfaces of neural networks, consistent with our observations.
0
0
0
Tue Jul 31 2018
Machine Learning
Spectrum concentration in deep residual learning: a free probability approach
We revisit the initialization of deep residual networks (ResNets) by introducing a novel analytical tool in free probability to the community of deep learning. This tool deals with non-Hermitian random matrices, rather than their conventional Hermitian counterparts in the literature.
0
0
0
Mon Nov 13 2017
Machine Learning
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
Dynamical isometry is the condition that all singular values of the Jacobian concentrate near For deep linear networks, this condition can be achieved through orthogonal weight initialization. On the other hand, sigmoidal networks can achieve this property only with orthogonally weight.
0
0
0
Sun Jun 17 2018
Machine Learning
Initialization of ReLUs for Dynamical Isometry
Deep learning relies on good initialization schemes and hyperparameter choices prior to training a neural network. Random weight initializations give rise to random network ensembles. The results obtained so far rely on mean field approximations that assume infinite layer width.
0
0
0
Fri Jan 08 2021
Computer Vision
Residual networks classify inputs based on their neural transient dynamics
We analyze the input-output behavior of residual networks from a dynamical system point of view. We show that there is a cooperation and competition dynamics between residuals corresponding to each input dimension. We also develop a new method to adjust the depth for residual networks during training.
0
0
0