Elastic Synchronization for Efficient and Effective Distributed Deep Learning

Date

2020-11-13

Authors

Zhao, Xing

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Training deep neural networks (DNNs) using a large-scale cluster with an efficient distributed paradigm significantly reduces the training time. However, a distributed paradigm developed only from system engineering perspective is most likely to hindering the model from learning due to the intrinsic optimization properties of machine learning. In this thesis, we present two efficient and effective models in the parameter server setting based on the limitations of the state-of-the-art distributed models such as staleness synchronous parallel (SSP) and bulk synchronous parallel (BSP).

We introduce DynamicSSP model that adds smart dynamic communication to SSP, improves its communication efficiency and replaces its fixed staleness threshold with a dynamic threshold. DynamicSSP converges faster and to a higher accuracy than SSP in the heterogeneous environment. Having recognized the importance of bulk synchronization in training, we propose the ElasticBSP model which shares the proprieties of bulk synchronization and elastic synchronization. We develop fast online optimization algorithms with look-ahead mechanisms to materialise ElasticBSP. Empirically, ElasticBSP achieves the convergence speed 1.77 times faster and an overall accuracy 12.6% higher than BSP.

Description

Keywords

Computer science

Citation