Published on Fri Oct 11 2019

Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

Igor Colin, Ludovic Dos Santos, Kevin Scaman

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures. We provide matching lower and upper complexity bounds for smooth convex and non-convex functions. We show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal.

0
0
0
Abstract

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal. For non-smooth convex functions, we provide a novel algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a multiplicative factor of the optimal convergence rate, where is the underlying dimension. While the convergence rate still obeys a slow convergence rate, the depth-dependent part is accelerated, resulting in a near-linear speed-up and convergence time that only slightly depends on the depth of the deep learning architecture. Finally, we perform an empirical analysis of the non-smooth non-convex case and show that, for difficult and highly non-smooth problems, PPRS outperforms more traditional optimization algorithms such as gradient descent and Nesterov's accelerated gradient descent for problems where the sample size is limited, such as few-shot or adversarial learning.

Mon Jul 03 2017
Machine Learning
Parle: parallelizing stochastic gradient descent
We propose a new algorithm called Parle for parallel training of deep networks. It converges 2-4x faster than a data-parallel implementation of SGD. Parle requires very infrequent communication with the parameterserver.
0
0
0
Sat May 07 2016
Machine Learning
Distributed stochastic optimization for deep learning (thesis)
Elastic Averaging SGD(EASGD) is a new distributed stochastic optimization method. EASGD is applied to train deep convolutional neural networks for image classification on CIFAR and ImageNet datasets. Our approach accelerates the training and achieves better test accuracy.
0
0
0
Wed Dec 09 2015
Machine Learning
Efficient Distributed SGD with Variance Reduction
Stochastic Gradient Descent (SGD) has become one of the most popular optimization methods for training machine learning models on massive datasets. SGD suffers from two main drawbacks: The noisy gradient updates have high variance, which slows down convergence as the iterates approach the optimum.
0
0
0
Tue Nov 25 2014
Machine Learning
Accelerated Parallel Optimization Methods for Large Scale Machine Learning
The growing amount of high dimensional data in different machine learning applications requires more efficient and scalable optimization algorithms. In this work, we consider combining two techniques, parallelism and Nesterov's accelerating algorithms.
0
0
0
Fri Nov 08 2019
Machine Learning
MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is very useful in optimization problems. Asynchronous, parallel SGD has received particular attention. We propose new models for capturing the nature of the staleness distribution in a practical setting.
0
0
0
Fri Feb 16 2018
Machine Learning
Distributed Stochastic Optimization via Adaptive SGD
Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. The most popular algorithm, Stochastic Gradient Descent (SGD), is a serial method that is surprisingly hard to parallelize.
0
0
0