Published on Sun Nov 01 2020

Two-Level K-FAC Preconditioning for Deep Learning

Nikolaos Tselepidis, Jonas Kohler, Antonio Orvieto

Many optimization methods use gradient information in order to accelerate the convergence of Stochastic Gradient Descent. In this work, we extend K-FAC by enriching it with global curvature information. We achieve this by adding a coarse-space correction term to the preconditioner.

0
0
0
Abstract

In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Stochastic Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix in stochastic gradient-based algorithms, with the most prominent one arguably being Adam. However, in recent years, several works cast doubt on the theoretical basis of preconditioning with the empirical Fisher matrix, and it has been shown that more sophisticated approximations of the actual Fisher matrix more closely resemble the theoretically well-motivated Natural Gradient Descent. One particularly successful variant of such methods is the so-called K-FAC optimizer, which uses a Kronecker-factored block-diagonal Fisher approximation as preconditioner. In this work, drawing inspiration from two-level domain decomposition methods used as preconditioners in the field of scientific computing, we extend K-FAC by enriching it with off-diagonal (i.e. global) curvature information in a computationally efficient way. We achieve this by adding a coarse-space correction term to the preconditioner, which captures the global Fisher information matrix at a coarser scale. We present a small set of experimental results suggesting improved convergence behaviour of our proposed method.

Mon Mar 26 2018
Machine Learning
Online Second Order Methods for Non-Convex Stochastic Optimizations
This paper proposes a family of online second order methods for possibly non-convex stochastic optimizations. It is based on the theory of preconditioned stochastic gradient descent (PSGD), which can be regarded as an enhancestochastically Newton method.
0
0
0
Mon Dec 14 2015
Machine Learning
Preconditioned Stochastic Gradient Descent
Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is now possible to precondition SGD to accelerate its convergence remarkably.
0
0
0
Fri Jun 08 2018
Machine Learning
Efficient Full-Matrix Adaptive Regularization
Adaptive regularization methods pre-multiply a descent direction by a pre-conditioning matrix. We show how to modify full-matrix adaptive regularization in order to make it practical and effective.
0
0
0
Fri Oct 18 2019
Machine Learning
First-Order Preconditioning via Hypergradient Descent
First-order preconditioning (FOP) is a fast, scalable approach that generalizes previous work on hyper gradient descent. FOP is able to improve the performance of standard deep learning optimizers on visual classification and reinforcement learning tasks with minimal computational head.
0
0
0
Mon Feb 26 2018
Machine Learning
Shampoo: Preconditioned Stochastic Tensor Optimization
Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization.
2
1
10
Fri Nov 27 2020
Machine Learning
Eigenvalue-corrected Natural Gradient Based on a New Approximation
Using second-order optimization methods for training deep neural networks has attracted many researchers. A recently proposed method, Eigenvalue-corrected Kronecker Factorization (EKFAC) (George et al., 2018), proposes an interpretation of viewing natural gradient update as a diagonal method.
0
0
0