Published on Thu Sep 12 2019

diffGrad: An Optimization Method for Convolutional Neural Networks

Shiv Ram Dubey, Soumendu Chakraborty, Swalpa Kumar Roy, Snehasis Mukherjee, Satish Kumar Singh, Bidyut Baran Chaudhuri

Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The main problem with basic SGD is to change by equal sized steps for all parameters, regardless of gradient behavior. In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters.

0
0
0
Abstract

Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad.

Tue Sep 07 2021
Machine Learning
Tom: Leveraging trend of the observed gradients for faster convergence
Tom is a novel variant of Adam that takes into account the trend which is observed for the gradients in the loss landscape traversed by the neural network. Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and faster convergence.
1
0
0
Fri Jan 01 2021
Machine Learning
Adam revisited: a weighted past gradients perspective
0
0
0
Fri May 21 2021
Machine Learning
AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks
AngularGrad is the first attempt to exploit the gradient angular information apart from its magnitude. It generates a score to control the step size of previous iterations. Theoretically, it exhibits the same regret bound as Adam for convergence purposes.
1
0
0
Wed Aug 25 2021
Machine Learning
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. Compared withochastic gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications.
1
0
0
Mon Nov 19 2018
Machine Learning
Deep Frank-Wolfe For Neural Network Optimization
The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm. We present an optimization method that offers empirically the best of both worlds. Our algorithm yields good generalization performance while requiring only one hyper-parameter.
0
0
0
Sun Oct 27 2019
Machine Learning
An Adaptive and Momental Bound Method for Stochastic Learning
Training deep neural networks requires intricate initialization and careful selection of learning rates. Adam produces extremely large learning rates that inhibit the start of learning. We propose the Adaptive and Momental Bound (AdaMod) method.
0
0
0