Published on Thu Oct 11 2018

signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, Anima Anandkumar

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. We explore aparticularly simple algorithm for robust, communication-efficient learning---signSGD.

0
0
0
Abstract

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

Tue Feb 13 2018
Machine Learning
signSGD: Compressed Optimisation for Non-Convex Problems
SignSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. It can get the best of both compressed gradients and SGD-level convergence rate.
0
0
0
Sun May 05 2019
Machine Learning
Fast and Robust Distributed Learning in High Dimension
Multi-Bulyan is the fastest (but non Byzantine resilient) rule for distributed machine learning. When (almost all workers are correct), multi- Bulyan reaches the speed of averaging. Its parallelisability further adds to its efficiency.
0
0
0
Mon Nov 18 2019
Machine Learning
Fast Machine Learning with Byzantine Workers and Servers
0
0
0
Fri Jun 12 2020
Machine Learning
O(1) Communication for Distributed SGD through Two-Level Gradient Averaging
0
0
0
Sun Jan 27 2019
Machine Learning
99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it
Many popular distributed optimization methods for training machine learning models fit the following template. A local gradient estimate is computed independently by each worker, then communicated to a master. The average is broadcast back to the workers, which use it to perform a gradient-type step to update the local version of the model.
0
0
0
Thu Nov 21 2019
Machine Learning
Communication-Efficient and Byzantine-Robust Distributed Learning with Error Feedback
We develop a communication-efficient distributed learning algorithm that is robust against Byzantine worker machines. We show that, in certain range of the compression factor , the (order-wise) rate of convergence is not affected by the operation. We analyze the compressed gradient descent algorithm with error feedback.
0
0
0