Published on Tue Jun 19 2018

Faster SGD training by minibatch persistency

Matteo Fischetti, Iacopo Mandatelli, Domenico Salvagnin

Large minibatches are of great practical interest as they allow for a better exploitation of modern GPUs. The approach is intended to speedup SGD convergence and also has the advantage of reducing overhead related to data loading on the internal GPU memory.

0
0
0
Abstract

It is well known that, for most datasets, the use of large-size minibatches for Stochastic Gradient Descent (SGD) typically leads to slow convergence and poor generalization. On the other hand, large minibatches are of great practical interest as they allow for a better exploitation of modern GPUs. Previous literature on the subject concentrated on how to adjust the main SGD parameters (in particular, the learning rate) when using large minibatches. In this work we introduce an additional feature, that we call minibatch persistency, that consists in reusing the same minibatch for K consecutive SGD iterations. The computational conjecture here is that a large minibatch contains a significant sample of the training set, so one can afford to slightly overfitting it without worsening generalization too much. The approach is intended to speedup SGD convergence, and also has the advantage of reducing the overhead related to data loading on the internal GPU memory. We present computational results on CIFAR-10 with an AlexNet architecture, showing that even small persistency values (K=2 or 5) already lead to a significantly faster convergence and to a comparable (or even better) generalization than the standard "disposable minibatch" approach (K=1), in particular when large minibatches are used. The lesson learned is that minibatch persistency can be a simple yet effective way to deal with large minibatches.

Fri Apr 20 2018
Machine Learning
Revisiting Small Batch Training for Deep Neural Networks
Modern deep neural network training is typically based on mini-batch optimizing gradient optimization. While the use of large mini-batches increases computational parallelism, small batch training has been shown to provide improved generalization performance.
1
477
1,350
Wed May 24 2017
Machine Learning
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
Deep learning models are typically trained using stochastic gradient descent or one of its variants. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance. The "generalization gap" stems from the relatively small number of updates rather than the batch size.
1
1
1
Thu Feb 13 2020
Machine Learning
Scalable and Practical Natural Gradient for Large-Scale Deep Learning
Large-scale distributed training of deep neural networks results in models with worse generalization performance. Previous approaches attempt to address this problem by varying the learning rate and batch size. We propose Scalable and Practical Natural Gradient Descent (SP-NGD), a principled approach for training models.
0
0
0
Wed Aug 23 2017
Machine Learning
Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
"Super-convergence" is a phenomenon where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-Convergence is relevant to understanding why deep networks generalize well.
7
19
86
Wed Dec 06 2017
Machine Learning
AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks
Training deep neural networks with Stochastic Gradient Descent requires careful choice of both learning rate and batch size. Smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency.
0
0
0
Sat Oct 03 2020
Machine Learning
Learning the Step-size Policy for the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm
We consider the problem of how to learn a step-size policy for the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm. We propose a neural network architecture with local information of the current iterate as the input. The step-length policy is learned from data of
0
0
0