Published on Mon Feb 26 2018

Shampoo: Preconditioned Stochastic Tensor Optimization

Vineet Gupta, Tomer Koren, Yoram Singer

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization.

2
1
10
Abstract

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.

Mon Mar 26 2018
Machine Learning
Online Second Order Methods for Non-Convex Stochastic Optimizations
This paper proposes a family of online second order methods for possibly non-convex stochastic optimizations. It is based on the theory of preconditioned stochastic gradient descent (PSGD), which can be regarded as an enhancestochastically Newton method.
0
0
0
Tue May 04 2021
Artificial Intelligence
Implicit Regularization in Deep Tensor Factorization
0
0
0
Thu Nov 15 2018
Machine Learning
Stable Tensor Neural Networks for Rapid Deep Learning
We propose a tensor neural network (-NN) framework that offers an exciting new paradigm for designing neural networks with multidimensional (tensor) data. Our network architecture is based on the $t-product, an algebraic formulation to multiply tensors via circulant convolution.
2
0
0
Fri Jun 08 2018
Machine Learning
Efficient Full-Matrix Adaptive Regularization
Adaptive regularization methods pre-multiply a descent direction by a pre-conditioning matrix. We show how to modify full-matrix adaptive regularization in order to make it practical and effective.
0
0
0
Fri Feb 19 2021
Machine Learning
Implicit Regularization in Tensor Factorization
We provide the first theoretical analysis of implicit regularization in tensor factorization -- tensor completion via certain type of non-linear neural network. We circumvent the notorious difficulty of tensor problems by adopting a dynamical systems perspective.
2
0
1
Wed Aug 30 2017
Machine Learning
Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives
Part 2 of this monograph builds on the introduction to tensor networks and their operations presented in Part 1. It focuses on tensor network models for super-compressed higher-order representation of data/parameters and related cost functions.
0
0
0
Mon Jun 12 2017
NLP
Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms. Experiments on two machine translation tasks show these models to be superior in
50
215
883
Mon Dec 22 2014
Machine Learning
Adam: A Method for Stochastic Optimization
Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and has little memory requirements. It is well suited for problems that are large in terms of data and parameters.
3
0
2
Thu Mar 19 2015
Neural Networks
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
Kronecker-Factored Approximate Curvature (K-FAC) is based on an efficiently invertible approximation of a neural network's Fisher information matrix. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as
1
1
2
Thu Nov 10 2016
Machine Learning
Understanding deep learning requires rethinking generalization
Conventional wisdom attributes small generalization error to properties of the model family or to the regularization techniques used during training. We show how these traditional approaches fail to explain why large neural networks generalize well in practice.
4
1
2
Mon Jun 08 2015
Neural Networks
Path-SGD: Path-Normalized Optimization in Deep Neural Networks
We revisit the choice of SGD for training deep neural networks. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network. We suggest Path-SGD, which is an approximate steepest descent method.
0
0
0
Mon Jun 08 2015
Machine Learning
Faster SGD Using Sketched Conditioning
We propose a novel method for speeding up stochastic optimization algorithms. We discuss the applicability of our method to deep neural networks, and experimentally demonstrate its merits.
0
0
0
Fri Jul 03 2020
Machine Learning
Descending through a Crowded Valley -- Benchmarking Deep Learning Optimizers
Analyze over 50,000 runs with different optimizers. Optimizer performance varies across tasks. Adam remains strong.
20
396
1,763
Thu Jun 10 2021
Machine Learning
Does Knowledge Distillation Really Work?
Knowledge distillation is a popular technique for training a small student network to emulate a larger teacher model. We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
10
91
455
Fri Jun 11 2021
Machine Learning
LocoProp: Enhancing BackProp via Local Loss Optimization
We study a local loss construction approach for optimizing neural networks. We improve the local loss construction by forming a Bregman divergence in each layer tailored to the local problem. We show that our construction consistently improves convergence.
7
69
333
Wed Jun 02 2021
Machine Learning
A Generalizable Approach to Learning Optimizers
System outperforms Adam at all neural network tasks including on modalities not seen during training. We achieve 2x speedups on ImageNet, and a 2.5x speedup on a language modeling task using over 1.5 orders of magnitude more compute than the training tasks.
4
29
145
Wed Jan 30 2019
Machine Learning
Memory-Efficient Adaptive Optimization
Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. These methods maintain second-order statistics for each parameter, thus introducing significant memory overheads. We describe an effective and flexible
2
7
29
Fri Jun 07 2019
Machine Learning
Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations
Natural- gradient methods enable fast and simple algorithms for Bayesian inference. Their use is mostly limited to \emph{minimal} exponential-family (EF) approximations. In this paper, we extend their application to estimate structures of EF distributions.
2
1
2