An, AijunZhao, Xing2020-11-132020-11-132020-072020-11-13http://hdl.handle.net/10315/37937Training deep neural networks (DNNs) using a large-scale cluster with an efficient distributed paradigm significantly reduces the training time. However, a distributed paradigm developed only from system engineering perspective is most likely to hindering the model from learning due to the intrinsic optimization properties of machine learning. In this thesis, we present two efficient and effective models in the parameter server setting based on the limitations of the state-of-the-art distributed models such as staleness synchronous parallel (SSP) and bulk synchronous parallel (BSP). We introduce DynamicSSP model that adds smart dynamic communication to SSP, improves its communication efficiency and replaces its fixed staleness threshold with a dynamic threshold. DynamicSSP converges faster and to a higher accuracy than SSP in the heterogeneous environment. Having recognized the importance of bulk synchronization in training, we propose the ElasticBSP model which shares the proprieties of bulk synchronization and elastic synchronization. We develop fast online optimization algorithms with look-ahead mechanisms to materialise ElasticBSP. Empirically, ElasticBSP achieves the convergence speed 1.77 times faster and an overall accuracy 12.6% higher than BSP.Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.Computer scienceElastic Synchronization for Efficient and Effective Distributed Deep LearningElectronic Thesis or Dissertation2020-11-13Distributed Deep LearningBSPASPSSPSGDOptimization