Published on Fri Jun 11 2021

Label Noise SGD Provably Prefers Flat Global Minimizers

Alex Damian, Tengyu Ma, Jason Lee

In overparametrized models, the noise in stochastic gradient descent (SGD)implicitly regularizes the optimization trajectory. We study the implicit regularization effect of SGD with label noise. We also prove extensions to classification with general loss functions.

3
29
206
Abstract

In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss $L(\theta) +\lambda R(\theta)$, where is the training loss, is an effective regularization parameter depending on the step size, strength of the label noise, and the batch size, and is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, SGD with momentum, and SGD with general noise covariance, significantly strengthening the prior work of Blanc et al. to global convergence and large learning rates and of HaoChen et al. to general models.

Tue Oct 17 2017
Artificial Intelligence
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
We consider two questions at the heart of machine learning. How can we predict if a minimum will generalize to the test set? And why does stochastic gradient descent find minima that generalize well?
1
0
4
Mon Jun 15 2020
Machine Learning
Shape Matters: Understanding the Implicit Bias of the Noise Covariance
The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. The paper theoretically characterizes this phenomenon on aquadratically-parameterize model.
0
0
0
Wed Mar 27 2019
Machine Learning
Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks
Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including pure noise.
0
0
0
Mon Jun 25 2018
Artificial Intelligence
Stochastic natural gradient descent draws posterior samples in function space
We prove that for sufficiently small learning rates, the stationary distribution of minibatch NGD approaches a Bayesian posterior near local minima. The temperature is controlled by the learning rate and training set size.
0
0
0
Thu Jan 18 2018
Machine Learning
When Does Stochastic Gradient Algorithm Work Well?
In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastically gradient descent (SGD) method with a fixed, large step size and propose a novel assumption
0
0
0
Fri Oct 27 2017
Machine Learning
Stochastic Conjugate Gradient Algorithm with Variance Reduction
Conjugate gradient (CG) methods are a class of important methods for solving linear equations and nonlinear optimization problems. In this paper, we propose a new stochastic CG algorithm with variance reduction.
0
0
0
Fri Feb 26 2021
Machine Learning
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
Full-batch gradient descent typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2. We hope that our findings will inspire future efforts aimed at more rigorously understanding optimization.
2
118
703
Sat Oct 03 2020
Machine Learning
Sharpness-Aware Minimization for Efficiently Improving Generalization
Sharpness-Aware Minimization (SAM) seeks parameters that lie in neighborhoods having uniformly low loss. This results in a min-max optimization problem on which gradient descent can be performed efficiently.
6
42
175
Fri Apr 19 2019
Machine Learning
Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process
We consider networks, trained via stochastic gradient descent to minimize loss. The training labels perturbed by independent noise at each progressivelyiteration. We characterize the behavior of the training dynamics near anyparameter vector that achieves zero training error.
1
4
14
Thu May 25 2017
Machine Learning
Implicit Regularization in Matrix Factorization
We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix with gradient descent on a factorization of . We provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, the minimum nuclear norm solution
0
0
0
Fri Mar 06 2015
Machine Learning
Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition
In many cases for non-convex functions the goal is to find a reasonable local minimum. The main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property that allows for efficient optimization.
0
0
0