Published on Sun Dec 24 2017

Spurious Local Minima are Common in Two-Layer ReLU Neural Networks

Itay Safran, Ohad Shamir

We consider the optimization problem associated with training simple ReLU neural networks. We provide a computer-assisted proof that even if the input distribution is standard Gaussian, the problem can still have spurious local minima.

0
0
0
Abstract

We consider the optimization problem associated with training simple ReLU neural networks of the form $\mathbf{x}\mapsto \sum_{i=1}^{k}\max\{0,\mathbf{w}_i^\top \mathbf{x}\}$ with respect to the squared loss. We provide a computer-assisted proof that even if the input distribution is standard Gaussian, even if the dimension is arbitrarily large, and even if the target values are generated by such a network, with orthonormal parameter vectors, the problem can still have spurious local minima once $6\le k\le 20$. By a concentration of measure argument, this implies that in high input dimensions, \emph{nearly all} target networks of the relevant sizes lead to spurious local minima. Moreover, we conduct experiments which show that the probability of hitting such local minima is quite high, and increasing with the network size. On the positive side, mild over-parameterization appears to drastically reduce such local minima, indicating that an over-parameterization assumption is necessary to get a positive result in this setting.

Sat Feb 10 2018
Machine Learning
Small nonlinearities in activation functions create bad local minima in neural networks
We investigate the loss surface of neural networks. We prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations for which there exists a bad local minimum.
0
0
0
Mon Nov 04 2019
Machine Learning
Sub-Optimal Local Minima Exist for Neural Networks with Almost All Non-Linear Activations
0
0
0
Mon Jun 15 2020
Machine Learning
Understanding Global Loss Landscape of One-hidden-layer ReLU Networks, Part 2: Experiments and Analysis
The existence of local minima for one-hidden-layer ReLU networks has been investigated theoretically in [8]. We first analyze how big the probability of existing local Minima is for 1D Gaussian data. We show that this probability is very low in most regions.
0
0
0
Wed Nov 21 2018
Artificial Intelligence
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. We show that for a broad family of loss functions, with proper random weight initialization, both gradient ascent and descent can find global
0
0
0
Tue Nov 20 2018
Neural Networks
Effect of Depth and Width on Local Minima in Deep Learning
In this paper, we analyze the effects of depth and width on the quality of local minima, without strong over-parameterization and simplification assumptions in the literature. We empirically support our theoretical observation with a synthetic dataset as well as MNIST, CIFAR-10
0
0
0
Mon Mar 23 2020
Neural Networks
Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses
Neural network losses enjoy a no-bad-local-minima property and an abundance of saddle points. The methods used to find these putative critical points suffer from a bad local minima problem of their own.
0
0
0