Published on Fri Nov 30 2018

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, Joseph Gonzalez

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time. The point at which these methods break down depends more on attributes like model architecture and data complexity than it does on the size of the dataset.

0
0
0
Abstract

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for \emph{either} train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.

Fri Oct 18 2019
Machine Learning
Improving the convergence of SGD through adaptive batch sizes
Mini-batch stochastic gradient descent (SGD) and variants thereof approximate the objective function's gradient with a small number of training examples. Small batch sizes require little computation for each model update but can yield high-variance gradient estimates. Large batches require more computation and can yield higher
0
0
0
Thu Sep 15 2016
Machine Learning
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. It has been observed that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize.
0
0
0
Fri Apr 20 2018
Machine Learning
Revisiting Small Batch Training for Deep Neural Networks
Modern deep neural network training is typically based on mini-batch optimizing gradient optimization. While the use of large mini-batches increases computational parallelism, small batch training has been shown to provide improved generalization performance.
1
477
1,350
Thu Feb 13 2020
Machine Learning
Scalable and Practical Natural Gradient for Large-Scale Deep Learning
Large-scale distributed training of deep neural networks results in models with worse generalization performance. Previous approaches attempt to address this problem by varying the learning rate and batch size. We propose Scalable and Practical Natural Gradient Descent (SP-NGD), a principled approach for training models.
0
0
0
Wed May 24 2017
Machine Learning
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
Deep learning models are typically trained using stochastic gradient descent or one of its variants. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance. The "generalization gap" stems from the relatively small number of updates rather than the batch size.
1
1
1
Mon Jun 15 2020
Machine Learning
The Limit of the Batch Size
Large-batch training is an efficient approach for current distributed deep learning systems. It has enabled researchers to reduce the ImageNet/ResNet-50training from 29 hours to around 1 minute. We think it may provide a guidance to AI supercomputer and algorithm designers.
0
0
0