Elastic Synchronization for Efficient and Effective Distributed Deep Learning

Zhao, Xing

Elastic Synchronization for Efficient and Effective Distributed Deep Learning

Files

Zhao_Xing_2020_Masters.pdf (6.34 MB)

Date

2020-11-13

Authors

Zhao, Xing

Abstract

Training deep neural networks (DNNs) using a large-scale cluster with an efficient distributed paradigm significantly reduces the training time. However, a distributed paradigm developed only from system engineering perspective is most likely to hindering the model from learning due to the intrinsic optimization properties of machine learning. In this thesis, we present two efficient and effective models in the parameter server setting based on the limitations of the state-of-the-art distributed models such as staleness synchronous parallel (SSP) and bulk synchronous parallel (BSP).

We introduce DynamicSSP model that adds smart dynamic communication to SSP, improves its communication efficiency and replaces its fixed staleness threshold with a dynamic threshold. DynamicSSP converges faster and to a higher accuracy than SSP in the heterogeneous environment. Having recognized the importance of bulk synchronization in training, we propose the ElasticBSP model which shares the proprieties of bulk synchronization and elastic synchronization. We develop fast online optimization algorithms with look-ahead mechanisms to materialise ElasticBSP. Empirically, ElasticBSP achieves the convergence speed 1.77 times faster and an overall accuracy 12.6% higher than BSP.

Keywords

Computer science

URI

http://hdl.handle.net/10315/37937

Collections

Computer Science and Engineering

Full item page

Elastic Synchronization for Efficient and Effective Distributed Deep Learning

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections